Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
29 views75 pages

NLP Notes Unit 1to5 Final

The document serves as course material for a Natural Language Processing (NLP) class at Vivekanandha College, detailing the origins, challenges, and key concepts of NLP. It covers various language modeling techniques including grammar-based and statistical language models, as well as the use of regular expressions in text processing. The document emphasizes the evolution of NLP from rule-based systems to modern machine learning approaches, highlighting the complexities and ethical considerations in the field.

Uploaded by

Boomika G
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views75 pages

NLP Notes Unit 1to5 Final

The document serves as course material for a Natural Language Processing (NLP) class at Vivekanandha College, detailing the origins, challenges, and key concepts of NLP. It covers various language modeling techniques including grammar-based and statistical language models, as well as the use of regular expressions in text processing. The document emphasizes the evolution of NLP from rule-based systems to modern machine learning approaches, highlighting the complexities and ethical considerations in the field.

Uploaded by

Boomika G
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

Page 1 of

Effective Date Page No.


75
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

Course Material

Department of
Programm
Department Computer Science and M.Sc(CS)
e
Applications
Course
Course Title Natural Language Processing
Code
Semester &
II Sem, 2024-
Class & Section I M.Sc(CS) Academic
25
Year
Designatio Associate
Handling Staff Dr.P.Sumitra
n Professor

Staff Incharge HoD Principal


VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

Page 2 of
Effective Date Page No.
75

Unit-I

UNIT I INTRODUCTION: Origins and challenges of NLP – Language Modeling: Grammar-


based LM, Statistical LM - Regular Expressions, Finite-State Automata – English Morphology,
Transducers for lexicon and rules, Tokenization, Detecting and Correcting Spelling Errors,
Minimum Edit Distance

Introduction : Origin and Challenges of NLP


What is NLP?
Natural Language Processing (NLP) is a subfield of Artificial Intelligence (AI) that focuses on
enabling computers to understand, interpret, and generate human language in a meaningful way. It
bridges the gap between human communication and machine understanding, making it possible for
machines to process text and speech data effectively.

Origins of NLP
1. Historical Roots:
o NLP has its origins in linguistics, computer science, and AI.
o Early work in the 1950s, such as the Turing Test (Alan Turing, 1950), laid the foundation for
machines understanding language.
2. Early Milestones:
o 1950s-1960s: Rule-based systems and symbolic approaches dominated the field.
 Example: Noam Chomsky's work on transformational grammar provided insights into the structure
of language.
o 1966: The ALPAC report criticized the progress of machine translation and shifted focus from overly
ambitious goals to practical methods.
o 1980s: Statistical methods gained prominence with the availability of computational power and large
datasets.
o 2000s onwards: The rise of machine learning, and later deep learning, revolutionized NLP.
3. Modern Era:
o With the introduction of word embeddings (e.g., Word2Vec) and transformer-based models like
BERT, NLP has become highly effective in tasks like machine translation, sentiment analysis, and
question answering.

Key Challenges in NLP


Despite significant advances, NLP faces several challenges due to the complexity and
variability of human language:
1. Ambiguity in Language
 Lexical Ambiguity: Words can have multiple meanings.
o Example: The word "bank" can mean a financial institution or the edge of a river.
 Syntactic Ambiguity: Different structures can yield different interpretations.
o Example: "I saw the man with a telescope" (Who has the telescope?).
 Semantic Ambiguity: Understanding context-specific meaning is difficult.
2. Context Understanding
 Human language depends heavily on context, including cultural, historical, and situational nuances.
o Example: "It's cold in here." (Could imply turning on the heater or wearing a jacket).
3. Resource Scarcity
Page 3 of
Effective Date Page No.
75
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

 Many languages, especially low-resource languages, lack sufficient labeled data or corpora for
effective NLP model training.
4. Multilingual Processing
 Developing systems that understand and process multiple languages effectively is a significant
challenge due to:
o Differences in grammar, syntax, and script.
o Lack of parallel datasets for translation.
5. Idiomatic Expressions
 Idioms and phrases often have meanings that cannot be deduced from the individual words.
o Example: "Kick the bucket" means "to die."
6. Domain-Specific Knowledge
 Language varies across domains, such as medicine, law, or finance, requiring tailored models to
understand domain-specific terms and jargon.
7. Ethical and Bias Issues
 NLP systems can inherit biases present in the training data, leading to unfair or unethical outcomes.
o Example: Gender bias in machine translations ("He is a doctor" vs. "She is a nurse").
8. Real-Time Processing
 Building real-time NLP systems, such as live transcription or translation tools, requires balancing
accuracy, speed, and computational resources.
.

Language Modeling: Grammar based LM


A Grammar-based Language Model (LM) in Natural Language Processing (NLP) refers to a
type of language model that relies on a set of predefined grammatical rules or structures to generate or
interpret language. Unlike statistical models that rely on large amounts of data to predict word
sequences based on probabilities, grammar-based models focus on syntactic structures and linguistic
rules to define how words should be arranged in a language.
Key Concepts:
1. Grammar Rules: These models use formal grammar rules, often based on context-free grammar
(CFG), to define sentence structures. These rules can specify how phrases and sentences should be
constructed and how words relate to one another syntactically.
2. Syntax-Driven Generation: A grammar-based LM typically generates sentences by applying a series
of grammatical rules that define how sentence constituents (like nouns, verbs, and adjectives) can
combine to form valid structures.
3. Syntax Parsing: Grammar-based models can be used in syntactic parsing, where the goal is to
analyze the grammatical structure of a sentence based on predefined rules. The output is often a parse
tree that shows how the sentence is structured according to these grammar rules.
4. Types of Grammar: Common forms of grammar used in these models include:
o Context-Free Grammar (CFG): The most widely used in NLP, where production rules map non-
terminal symbols to terminal symbols (words).
o Dependency Grammar: Focuses on the dependencies between words, rather than on hierarchical
phrase structures.
o Lexicalized Grammar: A grammar that includes lexical information (specific word forms) in the
rules.
Example:
 Grammar Rule (Context-Free Grammar):
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

Page 4 of
Effective Date Page No.
75

o Sentence → NounPhraseVerbPhrase
o NounPhrase → Determiner Noun
o VerbPhrase → Verb NounPhrase

 Sentence Generation:
o With the above rules, you could generate sentences like "The cat sleeps" by recursively applying these
rules.
Strengths and Limitations:
Strengths:
 Provides clear structure and interpretability.
 Allows for exact control over the syntactic structure of language.
 Useful in applications where syntactic correctness is crucial, such as grammar checking or syntactic
parsing.
Limitations:
 Often does not perform as well on complex or ambiguous language constructs compared to
probabilistic models (like n-gram models or neural LMs).
 Limited ability to capture semantic meaning or context beyond syntax.
 Requires a detailed set of rules, which can be difficult and time-consuming to define for a large and
diverse language.
Use Cases:
 Parsing: Syntax trees and grammatical structure analysis.
 Speech Recognition: Ensuring grammatically correct sequences of words.
 Machine Translation: Ensuring the output of translations adheres to the grammar of the target
language.
In modern NLP, grammar-based models have been largely supplanted by data-driven and neural
network-based models due to their ability to generalize better and handle a wider range of language
phenomena. However, grammar-based approaches remain useful for specific applications, such as
syntactic parsing or tasks requiring precise control over sentence structure.

Satistical LM
In natural language processing (NLP), a statistical language model (LM) is a model that uses
statistical methods to predict the probability of a sequence of words. These models are based on the
assumption that the occurrence of each word in a text depends probabilistically on the previous
words, and this dependency is typically captured in a mathematical framework.
Here’s a breakdown of what a statistical language model in NLP involves:
Key Concepts:
1. Probability Distribution:
o The core idea behind a statistical language model is that it assigns a probability to a sequence of
words, such as a sentence or phrase. For example, the model would calculate the probability of a
sentence like "The cat sat on the mat."
2. N-grams:
o One common approach in statistical language modeling is the n-gram model, which focuses on the
conditional probability of a word based on the preceding n−1n-1n−1 words. For example, in a bigram
(2-gram) model, the probability of a word depends on the previous word.
Page 5 of
Effective Date Page No.
75
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

P(wn∣w1,w2,...,wn−1)≈P(wn∣wn−1)P(w_n | w_1, w_2, ..., w_{n-1}) \approx P(w_n | w_{n-


1})P(wn∣w1,w2,...,wn−1)≈P(wn∣wn−1)
This means the model predicts the probability of the nnn-th word based on the (n−1)(n-1)(n−1)-th
word.

3. Training the Model:


o A statistical language model is trained using a large corpus of text. During training, the model learns
the probability distributions of word sequences, usually by counting how often sequences of words
appear in the corpus.
4. Smoothing:
o Statistical models like n-grams can face issues when encountering unseen word sequences (e.g., if a
particular combination of words has never been observed in the training data). Smoothing techniques
such as Additive Smoothing (Laplace smoothing) are used to assign non-zero probabilities to
unseen word sequences.
5. Application:
o Speech Recognition: Statistical language models help predict the most likely sequence of words in
spoken language.
o Machine Translation: They are used to improve translation quality by selecting the most likely
sequence of words in the target language.
o Text Generation: They assist in generating human-like text by predicting the most probable next
word in a sequence.
Examples of Statistical Models:
1. Unigram Model:
o A unigram model assumes that each word is independent of the others and calculates the probability
of each word in isolation.
o P(w1,w2,...,wn)=P(w1)P(w2)...P(wn)P(w_1, w_2, ..., w_n) = P(w_1)P(w_2)...P(w_n)P(w1,w2,...,wn
)=P(w1)P(w2)...P(wn)
o This is a very simplistic model but can be useful in some scenarios.
2. Bigram Model:
o A bigram model considers the previous word when predicting the next word.
o P(w1,w2,...,wn)=P(w1)P(w2∣w1)P(w3∣w2)...P(wn∣wn−1)P(w_1, w_2, ..., w_n) = P(w_1)P(w_2 |
w_1)P(w_3 | w_2)...P(w_n | w_{n-1})P(w1,w2,...,wn)=P(w1)P(w2∣w1)P(w3∣w2)...P(wn∣wn−1)
3. Trigram Model:
o A trigram model looks at the two previous words when predicting the next one.
o P(w1,w2,...,wn)=P(w1)P(w2∣w1)P(w3∣w1,w2)...P(wn∣wn−2,wn−1)P(w_1, w_2, ..., w_n) =
P(w_1)P(w_2 | w_1)P(w_3 | w_1, w_2)...P(w_n | w_{n-2}, w_{n-1})P(w1,w2,...,wn)=P(w1)P(w2∣w1
)P(w3∣w1,w2)...P(wn∣wn−2,wn−1)
Limitations:
 Data Sparsity: Higher-order n-grams (like trigrams and beyond) tend to suffer from sparse data
issues where many possible word sequences are never seen during training.
 Contextual Understanding: Statistical models often fail to capture long-range dependencies or the
deeper semantic meaning of sentences.
Modern Alternatives:
 While statistical language models were once dominant, neural network-based models like RNNs
(Recurrent Neural Networks) and Transformers have largely replaced them in recent years due to
their ability to model complex patterns and long-range dependencies more effectively.
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

Page 6 of
Effective Date Page No.
75

 However, statistical LMs are still important in understanding the evolution of language modeling in
NLP.

Regular Expressions
Regular expression (RE): a language for specifying text search stringsThey are particularly
useful for searching in texts, when we have apattern to search for and a corpus of texts to search
through
○ In an information retrieval (IR) system such as a Web searchengine, the texts might be
entiredocuments or Web pages
○ In a word-processor, the texts might be individual words, or linesof a document
● grep command in Linux
○ grep ‘nlp’ /path/file
The simplest kind of regular expression is a sequence of simple characters
● For example, to search for language, we type /language/
● The search string can consist of a single character (like /!/) or a sequence of characters (like/urgl/)
Regular expressions are case-sensitive; lower case /s/ is distinct from uppercase /S/
Can use of the square braces
● The string of characters inside the braces specifies a disjunction of characters to match
● /[lL]anguage/
● The regular expression /[1234567890]/ specified any single digit
Ranges in []: If there is a well-defined sequence associated with a set of characters, dash (-)
inbrackets can specify any one character in a range
● /[A-Z]/ an upper case letter “we should call it ‘Drenched Blossoms’ ”Negations in []:
● The square braces can also be used to specify what a single character cannot be, using the caret ^
● If the caret ^ is the first symbol after the open square brace [, the resulting pattern is negated
● [^A-Z] Not an upper case letter
● [^a-z] Not a lower case letter
● [^Ss] Neither ‘S’ nor ‘s’
If E1
and E2
are regular expressions, then E1
| E2
is a regular expression
● woodchuck|groundhog → woodchuck or groundhog
● a|b|c → a, b or c
Kleene * (closure) operator: The Kleene star means “zero or more occurrences of theimmediately
previous regular expression
● Kleene + (positive closure) operator: The Kleene plus means “one or more occurrences of
the immediately preceding regular expression
Regular Expression Matches
ba* b, ba, baa, baaa, ...
ba+ ba, baa, baaa, ...
(ba)* ∅, ba, baba, bababa, ...
(ba)+ ba, baba, bababa, ...
(b|a)+ b, a, bb, ba, aa, ab, ...
Page 7 of
Effective Date Page No.
75
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

A wildcard expression (dot) . matches any single character (except a carriage return)
○ beg.n → begin, begun, begxn, …
○ a.*b → any string starts with a and ends with b
● The question mark ? marks optionality of the previous expression
○ woodchucks? → woodchuck or woodchucks
○ colou?r → color or colour
○ (a|b)?c → ac, bc, c
● {m,n} causes the resulting RE to match from m to n repetitions of the preceding RE
● {m} specifies that exactly m copies of the previous RE should be matched
○ (ba){2,3} → baba, bababaThe order precedence of RE operator precedence, from highest
precedence to
lowest precedence is as follows
● Parenthesis ()
● Counters * + ? {}
● Sequences and anchors ^ $
● Disjunction |
● The regular expression the* matches theeeee but not thethe
● The regular expression (the)* matches thethe but not theeeee
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

Page 8 of
Effective Date Page No.
75

Finite State Automata

Any regular expression can be realized as a finite state automaton (FSA)There are two kinds
of FSAs
● Deterministic Finite state Automatons (DFAs)
● Non-deterministic Finite state Automatons (NFAs)

Finite State Automata (FSA) are a crucial concept in Natural Language Processing (NLP) and
formal language theory. They are computational models used to recognize and process patterns in
input sequences, such as strings of text, in a way that can be implemented efficiently.
What is a Finite State Automaton (FSA)?
A Finite State Automaton (FSA) consists of:
1. States: A finite set of states, including one starting state and possibly one or more accepting (final)
states.
2. Alphabet: A finite set of symbols (or characters) that the automaton reads as input.
3. Transitions: A set of rules that dictate how the automaton moves between states based on the input
symbols.
4. Start State: The initial state from where the automaton begins its operation.
5. Accepting States: States where the automaton halts and accepts the input as valid.
FSAs can be deterministic (DFA) or non-deterministic (NFA), depending on whether there is
more than one possible state transition for a given input from a specific state.
Application of FSAs in NLP
1. Morphological Analysis:
o FSAs are used in morphological parsing to analyze the structure of words and identify their
morphemes (smallest units of meaning). For instance, they can model how affixes like suffixes or
prefixes modify root words (e.g., "run" + "ing" = "running").
2. Lexical Analysis:
o In lexical analysis, FSAs can be used to tokenize text, breaking it into meaningful units like words,
numbers, punctuation marks, etc. This is especially useful in languages with complex word
boundaries or where tokens can change form depending on their context.
3. Finite State Transducers (FSTs):
o An extension of FSAs, Finite State Transducers are used for tasks like morphological generation and
machine translation. They map input strings to output strings, making them useful for tasks that
require the transformation of text (e.g., converting a word into its plural form or verb tense).
Page 9 of
Effective Date Page No.
75
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

4. Part-of-Speech Tagging:
o FSAs can be employed in simpler part-of-speech (POS) tagging tasks, where the transitions between
states represent transitions between different POS tags in a sequence of words.
5. Text Normalization:
o FSAs can be used in text normalization tasks, where variations in spelling, abbreviations, and other
forms of informal language are transformed into a standardized form (e.g., converting "wanna" to
"want to").
6. Speech Recognition:
o FSAs are also widely used in speech recognition systems to model and recognize phonemes or
syllables, which are then mapped to words and sentences.
Example: Simple FSA for Word Recognition
Let’s consider a simple FSA that recognizes the word "cat."
 States: S0 (start), S1, S2 (accepting state)
 Alphabet: {c, a, t}
 Transitions:
o S0 → S1 on 'c'
o S1 → S2 on 'a'
o S2 → S2 on 't' (S2 is an accepting state)
This automaton will accept the word "cat" by transitioning through the states in order, accepting it
when it reaches state S2.
Advantages of FSAs in NLP:
 Simplicity: They are easy to implement and understand.
 Efficiency: FSAs can be processed quickly, making them suitable for applications that require fast
pattern recognition.
 Deterministic Nature: DFAs, in particular, can provide guarantees of linear-time processing for
some tasks, making them highly efficient.
Limitations:
 Limited Expressiveness: FSAs are not suitable for handling more complex linguistic phenomena,
such as context-free or context-sensitive grammar rules. They cannot model certain dependencies
between distant parts of a sentence, such as nested structures (e.g., recursive constructions).
 Memory Usage: For more complex tasks, FSAs may require a large number of states or transitions,
which could make them inefficient or difficult to manage.
In summary, Finite State Automata are fundamental to many NLP tasks, especially for simpler
syntactic and morphological processing. However, for more complex linguistic phenomena, other
models like context-free grammars or deep learning approaches may be more appropriate.

English Morphology
English morphology in Natural Language Processing (NLP) refers to the study of the structure
and formation of words. It involves analyzing how words are built from smaller meaningful units
called morphemes. In NLP, understanding morphology is crucial for tasks like word tokenization,
part-of-speech (POS) tagging, and named entity recognition, among others.
Key Concepts in English Morphology
1. Morpheme: The smallest unit of meaning in a language. There are two main types of morphemes:
o Free Morphemes: These can stand alone as words (e.g., "book", "run").
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

Page 10 of
Effective Date Page No.
75

o Bound Morphemes: These cannot stand alone and must attach to free morphemes to convey meaning
(e.g., prefixes like "un-" in "undo", suffixes like "-ing" in "running").
2. Inflectional Morphology: The modification of a word to express grammatical features such as tense,
number, person, case, or gender.
o Verb Conjugation: Modifying a verb to indicate tense or aspect (e.g., "run" → "runs", "running",
"ran").
o Noun Plurals: Modifying a noun to indicate plurality (e.g., "cat" → "cats").
o Adjective Comparatives and Superlatives: (e.g., "big" → "bigger", "biggest").
3. Derivational Morphology: The process of creating new words by adding prefixes or suffixes to base
words, changing their meaning or part of speech.
o Examples: "happy" → "unhappy" (negation with prefix "un-"), "run" → "runner" (changing verb to
noun with suffix "-er").
4. Compounding: The process of combining two or more words to create a new word (e.g., "tooth" +
"paste" → "toothpaste").
Key NLP Tasks Related to English Morphology
1. Morphological Analysis:
o Involves identifying the morphemes within a word, which is essential for tasks like stemming and
lemmatization.
o Example: "better" → "good" (lemmatization), "running" → "run" (stemming).
2. Stemming:
o The process of reducing a word to its root form by removing suffixes (and sometimes prefixes), often
with little regard for whether the root form is a valid word.
o Example: "running" → "run", "better" → "better" (no change).
o Common algorithms: Porter Stemmer, Lancaster Stemmer, Snowball Stemmer.
3. Lemmatization:
o A more sophisticated process than stemming, lemmatization reduces a word to its base form (lemma),
taking into account the context and part of speech (e.g., "better" → "good", "running" → "run").
o Lemmatization requires more linguistic knowledge than stemming, as it considers word meanings.
4. Part-of-Speech (POS) Tagging:
o POS tagging assigns each word in a sentence its corresponding part of speech (e.g., noun, verb,
adjective). Morphological analysis helps identify the appropriate tag based on inflectional forms.
o Example: In the sentence "She is running," "running" is tagged as a verb, while in "He has running
shoes," it is tagged as an adjective.
5. Morphological Generation:
o This involves creating words from their base forms, useful in tasks like machine translation or text
generation where correct word forms are needed based on context.
o Example: Generating the correct verb tense or plural form based on sentence context.
Morphological Resources in NLP
1. Lexical Databases:
o WordNet: A large lexical database of English where words are grouped into sets of synonyms
(synsets), and relationships between words (such as hyponymy, meronymy, and antonymy) are
defined.
o Useful for lemmatization and understanding word meanings, especially in tasks like word sense
disambiguation.

2. Morphological Analyzers:
Page 11 of
Effective Date Page No.
75
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

o Tools like SpaCy or NLTK have built-in morphological analyzers that can parse and lemmatize
words, identifying their root forms and parts of speech.
o They rely on large corpora of annotated language data and rules to analyze inflections.
3. Finite State Transducers (FSTs):
o FSTs are often used in NLP systems for morphological analysis and generation. They are particularly
good at handling languages with complex morphology by modeling the transformations between
word forms (e.g., "run" → "running").
Challenges in English Morphology
1. Irregular Forms:
o English has many irregular forms that don't follow simple rules of inflection. For example, the past
tense of "go" is "went", and the plural of "child" is "children". Handling these exceptions requires
more sophisticated methods than standard stemming or rule-based systems.
2. Ambiguity:
o Some words can be different parts of speech or have different meanings based on context (e.g., "run"
as a noun and verb). Morphological analysis needs to disambiguate based on context, which can be
challenging.
3. Compound Words:
o English allows the creation of new words by combining existing words (e.g., "notebook",
"sunflower"). Handling compound word analysis and breaking them into their component morphemes
is a difficult task.
4. Loanwords and Variants:
o English borrows words from other languages, which may have different morphological rules. Proper
handling of these in NLP systems requires cross-lingual or language-independent approaches.
Example of Morphology in Action: Lemmatization
Let’s see how morphological analysis works in lemmatization:
 "Cats" → "cat" (remove plural suffix -s)
 "Running" → "run" (remove progressive suffix -ing)
 "Better" → "good" (irregular adjective comparison)

Transducers for lexicon and rules


A Finite State Transducer (FST) is an extension of a Finite State Automaton (FSA). While
an FSA accepts or rejects strings, an FST maps one string to another. In NLP, FSTs are particularly
useful for modeling linguistic transformations like morphological analysis and generation, where the
input string (e.g., a word form) is mapped to an output string (e.g., its lemma, plural form, or past
tense).
Key Characteristics of FSTs:
1. States: FSTs consist of a finite set of states, one of which is the start state, and some are accepting
states.
2. Input and Output Alphabets: FSTs have two alphabets: an input alphabet (the set of symbols that
the transducer reads) and an output alphabet (the set of symbols the transducer produces).
3. Transitions: The transitions between states are labeled with pairs of symbols (input, output). When
the automaton is in a state and reads an input symbol, it transitions to another state and produces an
output symbol.
4. Determinism: FSTs can be either deterministic (DFSTs) or non-deterministic (NFSTs). Deterministic
FSTs are simpler and more efficient, as they have one transition for each input symbol from any given
state, while non-deterministic FSTs may have multiple transitions for a single input symbol.
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

Page 12 of
Effective Date Page No.
75

Example: Morphological Analysis Using FST


Consider the task of stemming the word "running" to its root form "run". An FST for this task
might include states and transitions that map the input form "running" to the output form "run", as
follows:
 States: S0 (start), S1 (after "run"), S2 (final accepting state)
 Alphabet: {r, u, n, n, i, n, g} (letters in the word "running")
 Transitions:
o S0 → S1 on input "run"
o S1 → S2 on input "ning" → output "ning"
o S2 → end of processing with output "run" (root form)
This FST works by analyzing the suffix "ning", removing it, and returning the root form "run".
Types of NLP Tasks Handled by FSTs:
1. Morphological Analysis: FSTs can be used to decompose words into their morphemes (e.g.,
"running" → "run" + "ing").
2. Morphological Generation: FSTs can also generate word forms by adding appropriate morphemes
to a root word (e.g., "run" → "running").
3. Spelling Correction: FSTs can model spelling correction rules where common misspellings are
mapped to correct forms.
4. Part-of-Speech Tagging: FSTs can be used for tagging words with their respective parts of speech,
considering the morphological structure.
5. Machine Translation: FSTs can be used for translating between languages at the word or morpheme
level, particularly in rule-based translation systems.
Lexicons in NLP
In NLP, a lexicon is a collection of words and their associated meanings, forms, and other
properties like part-of-speech, tense, or number. Lexicons are often coupled with rules for handling
different linguistic phenomena such as word formation, verb conjugation, and noun pluralization.
FSTs and lexicons work together to handle tasks like morphological analysis and generation.
Lexicon-Rule Interaction in NLP:
1. Lexical Entries: The lexicon contains a list of words, along with information about their possible
inflections, derivations, and grammatical features.
o For example, an entry for the word "run" might include its base form and possible derivations like
"running", "runs", and "ran".
2. Lexicon and Morphological Rules: Rules for conjugation, declension, and other morphological
phenomena are applied to lexicon entries. These rules can be encoded in FSTs and work by applying
transformations to the base forms of words.
o Example: A rule could specify that the verb "run" becomes "running" in the present continuous tense
by adding the suffix "-ing".
Example: Lexicon + FST for Conjugation
Let’s consider the verb "to walk" in English and its conjugation in different tenses.
1. Lexicon for "walk":
o Base form: "walk"
o Past tense: "walked"
o Progressive form: "walking"
2. Morphological Rule: Add "-ed" to form the past tense of regular verbs, and add "-ing" to form the
progressive tense.
3. FST for Transformation:
o S0 → S1 on "walk" → output "walk"
Page 13 of
Effective Date Page No.
75
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

o S1 → S2 on "ed" → output "walked"


o S0 → S1 on "walk" → output "walk"
o S1 → S2 on "ing" → output "walking"
4. The FST can be used to generate different forms of the verb based on the given rules from the
lexicon.
Benefits of Using FSTs in NLP:
1. Efficiency: FSTs are computationally efficient, with linear-time processing for many tasks. This is
particularly useful when dealing with large datasets, such as corpora of text.
2. Formalism: FSTs offer a formal way to describe language rules, making it easier to reason about and
implement transformations in NLP systems.
3. Modularity: Lexicons and rules can be added or modified independently, making systems more
flexible and easier to maintain.
Challenges:
1. Complexity with Irregular Forms: English (and other languages) has many irregular forms (e.g.,
"go" → "went", "child" → "children") that may not be easily handled by simple FSTs. Special rules
or additional models may be required.
2. State Explosion: When handling languages with rich morphology, the number of possible states and
transitions in an FST can grow quickly, leading to scalability issues.
3. Handling Ambiguity: Some linguistic phenomena (such as homophones or polysemy) may be
challenging to handle with purely rule-based systems like FSTs.

Tokenization
Tokenization in Natural Language Processing (NLP) is the process of breaking down text into
smaller units called tokens. Tokens can be words, subwords, characters, or even sentences, depending
on the specific application. Tokenization is one of the most fundamental steps in NLP pipelines, as it
prepares raw text for further processing.
Types of Tokenization
1. Word-Level Tokenization
o Splits text into individual words.
o Example:
Input: *"Tokenization is important."
_________________________________________________________________________________
______

Detecting and correcting spelling errors


Detecting and correcting spelling errors in NLP is essential for improving the quality and
accuracy of text data. This process, often referred to as spell checking or spell correction, involves
identifying misspelled words and replacing them with their correct forms.
Steps for Spelling Error Detection and Correction
1. Detection
Identify words that are not valid or do not exist in the dictionary or language model.
2. Candidate Generation
Generate a list of possible correct spellings for the detected misspelled word. Techniques include:
o Edit Distance: Words with a small number of edits (insertion, deletion, substitution, or transposition)
from the misspelled word.
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

Page 14 of
Effective Date Page No.
75

o Phonetic Matching: Use phonetic algorithms like Soundex or Metaphone to find words that sound
similar.
o Confusion Sets: Predefined sets of commonly confused words (e.g., "their" vs. "there").
3. Ranking Candidates
Rank the possible corrections based on their likelihood of being correct. Common approaches:
o Language Models: Use statistical or neural language models to find the most probable word in
context.
o Frequency: Prefer words that appear more frequently in the corpus.
o Contextual Similarity: Evaluate how well the candidate fits in the sentence context.
4. Correction
Replace the misspelled word with the highest-ranked candidate.
Techniques for Spell Checking
1. Rule-Based Methods
Use predefined rules and dictionaries to detect and correct errors.
Example: Correcting "teh" to "the".
2. Statistical Approaches
o Use n-grams to model the likelihood of sequences of words or characters.
o Noisy Channel Model: Treat spelling errors as a result of "noisy" transmission and try to decode the
intended word.
3. Machine Learning
o Train models on large datasets of text with labeled spelling errors and corrections.
o Context-sensitive spell checkers like neural language models (e.g., BERT, GPT) can better handle
errors in the context of a sentence.
4. Neural Network Approaches
o Use sequence-to-sequence models or transformers to detect and correct spelling errors.
o Example: Fine-tune models like T5, GPT, or BERT for error correction tasks.

Minimum Edit Distance


The Minimum Edit Distance in Natural Language Processing (NLP) refers to the smallest
number of operations required to convert one string into another. This concept is widely used in text
processing tasks such as spell checking, plagiarism detection, DNA sequence alignment, and machine
translation.
Edit Operations
The minimum edit distance is computed based on three primary operations:
1. Insertion: Add a character.
2. Deletion: Remove a character.
3. Substitution: Replace one character with another.
Each operation is typically assigned a cost, and the goal is to minimize the total cost. By default:Each
operation has a cost of 1 (but these costs can vary depending on the application).
1. same, else 1)} \end{cases}dp[i][j]=min⎩⎨⎧dp[i−1][j]+1dp[i][j−1]+1dp[i−1][j−1]+cost(substitution)
(deletion)(insertion)(substitution, cost = 0 if same, else 1)
2. The value at dp[m][n] (bottom-right cell) gives the minimum edit distance, where m and n are the
lengths of the strings.
Example
To convert "kitten" to "sitting":
1. Deletion: Remove 'k' from "kitten".
Page 15 of
Effective Date Page No.
75
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

2. Substitution: Replace 'e' with 'i'.


3. Insertion: Add 'g' at the end.
Operations = 3
Minimum Edit Distance = 3
Applications
 Spell correction: Find the closest word from a dictionary.
 Plagiarism detection: Compute similarity between texts.
 Machine Translation: Compare generated translations with reference translations.
 Bioinformatics: Compare DNA, RNA, or protein sequences.

End of Unit-I
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

Page 16 of
Effective Date Page No.
75

Unit-II

UNIT II WORD LEVEL ANALYSIS: Unsmoothed N-grams, Evaluating N-grams, Smoothing,


Interpolation and Backoff – Word Classes, Part-of-Speech Tagging, Rule-based, Stochastic and
Transformation-based tagging, Issues in PoS tagging – Hidden Markov and Maximum Entropy
models.
Word Level Analysis : Unsmoothed N-grams
In Natural Language Processing (NLP), unsmoothed N-grams refer to sequences of nnn words
analyzed without applying any smoothing techniques to handle zero probabilities for unseen word
sequences. This raw approach can provide insights into the exact frequency and co-occurrence of
words in a text corpus, making it foundational for many tasks in NLP.
Key Concepts in Unsmoothed N-grams
1. N-grams:
o An N-gram is a contiguous sequence of nnn words from a given text.
 Unigram: n=1n = 1n=1 (single words).
 Bigram: n=2n = 2n=2 (word pairs).
 Trigram: n=3n = 3n=3 (three-word sequences).
 And so on.
2. Unsmoothing:
o In unsmoothed N-grams, the probabilities of word sequences are calculated directly from their
observed frequencies in the corpus: P(w1,w2,...,wn)=Count(w1,w2,...,wn)Total Count of N-
grams in the CorpusP(w_1, w_2, ..., w_n) = \frac{\text{Count}(w_1, w_2, ..., w_n)}{\text{Total
Count of N-grams in the Corpus}}P(w1,w2,...,wn)=Total Count of N-grams in the CorpusCount(w1
,w2,...,wn)
o If a sequence does not occur in the training data, its probability is zero.
3. Challenges:
o Data Sparsity: Many word sequences may not appear in the training corpus, resulting in zero
probabilities.
o Unseen Events: The model cannot generalize to unseen sequences, limiting its applicability in real-
world scenarios.
Applications of Unsmoothed N-grams
1. Language Modeling:
o N-grams are used to model the probability distribution of word sequences, forming the basis of many
simple language models.
2. Text Analysis:
o Analyze word frequency, co-occurrence, and patterns in text without adjusting probabilities for rare
events.
3. Machine Translation and Summarization:
o Evaluate the structure of phrases and sentences in a corpus.
4. Information Retrieval:
o Identify key phrases or sequences in a text to match user queries.
Advantages of Unsmoothed N-grams
1. Simplicity:
o Easy to calculate and understand.
2. Exact Representation:
o Provides raw insights into word distribution and sequence occurrence.
3. Interpretability:
Page 17 of
Effective Date Page No.
75
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

o No additional assumptions (e.g., smoothing) obscure the results.

Limitations
1. Zero Probabilities:
o Any sequence not in the training data has a probability of zero, making the model unusable in those
cases.
2. Data Dependency:
o Requires large corpora to cover a wide range of word sequences.
3. Scalability:
o As nnn increases, the number of possible N-grams grows exponentially, leading to the "curse of
dimensionality."
When to Use Unsmoothed N-grams?
1. Exploratory Analysis:
o When analyzing raw word-level patterns in text data.
2. Baseline Models:
o For establishing a benchmark before applying advanced techniques like smoothing or neural models.
3. Specific NLP Tasks:
o When zero probabilities are not critical, or the focus is on observed patterns only.

Evaluating N-grams Smoothing

Smoothing techniques address the problem of zero probabilities in N-gram models by redistributing
some probability mass from seen N-grams to unseen ones. Evaluating the performance of smoothing
methods is crucial to assess how well a language model generalizes to unseen data and avoids
overfitting.
1. Why Smoothing Is Important in N-gram Models
 Unseen N-grams: Without smoothing, any unseen N-gram will have a probability of zero, making
the model unusable for predicting sequences containing those N-grams.
 Better Generalization: Smoothing ensures the model can handle rare or unseen word sequences
effectively.
 Improved Perplexity: By redistributing probability mass, smoothing generally leads to lower
perplexity on test data, indicating better predictions.
2. Common Smoothing Techniques
1. Laplace (Add-One) Smoothing:
o Adds 1 to the count of every possible N-gram to avoid zero probabilities.
o Formula: P(wn∣wn−1)=Count(wn−1,wn)+1Total Count of Bigrams+VP(w_n | w_{n-1}) =
\frac{\text{Count}(w_{n-1}, w_n) + 1}{\text{Total Count of Bigrams} + V}P(wn∣wn−1
)=Total Count of Bigrams+VCount(wn−1,wn)+1 where VVV is the vocabulary size.
2. Add-kkk Smoothing:
o Generalizes Laplace smoothing by adding a smaller constant k>0k > 0k>0.
o Reduces the overestimation of probabilities for unseen N-grams compared to Laplace smoothing.
3. Good-Turing Smoothing:
o Adjusts the probability of seen and unseen N-grams based on the counts of N-grams with similar
frequencies.
o Effective for redistributing probability mass to unseen events.
4. Kneser-Ney Smoothing:
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

Page 18 of
Effective Date Page No.
75

o Combines absolute discounting with backing off to lower-order models.


o Specifically designed for language modeling and is often the most effective for N-grams.
o Captures the diversity of contexts where a word appears.
5. Backoff and Interpolation:
o Backoff: Uses lower-order N-grams when higher-order N-grams are unavailable.
o Interpolation: Combines probabilities from higher- and lower-order N-grams.
3. Metrics for Evaluating Smoothing Techniques
3.1 Perplexity
 Measures how well the model predicts a test dataset.
 Lower perplexity indicates better predictions.
 Use: Compare perplexity across different smoothing methods to determine the most effective one.
3.2 Coverage
 Measures how many N-grams in the test set have non-zero probabilities after smoothing.
 Higher coverage: Indicates that the smoothing method successfully handles unseen N-grams.
3.3 Precision and Recall
 Evaluate how accurately the smoothed model predicts N-grams compared to reference sequences.
 Use: Helpful in tasks like machine translation or text generation.
3.4 BLEU/ROUGE Scores
 Evaluate the impact of smoothing on downstream tasks like machine translation (BLEU) or
summarization (ROUGE).
 Higher scores: Indicate that the smoothing method improves the quality of generated text.
4. Practical Considerations
1. Dataset Size:
o Smaller datasets often require more aggressive smoothing techniques like Laplace or Add-kkk.
o Larger datasets can benefit from advanced methods like Kneser-Ney smoothing.
2. Vocabulary Size:
o Larger vocabularies increase the number of unseen N-grams, making effective smoothing essential.
3. Higher-Order N-grams:
o Higher nnn (e.g., trigrams, 4-grams) suffer more from data sparsity, making advanced smoothing
methods critical.
4. Task-Specific Requirements:
o Some tasks (e.g., ASR, machine translation) may benefit more from sophisticated smoothing
techniques like Kneser-Ney due to their contextual sensitivity.
5. Interpreting Results
 Perplexity: Use test data to compare perplexity scores across smoothing methods.
 Probabilities: Compare how each method redistributes probability mass to unseen or rare N-grams.
 Task Performance: Evaluate BLEU/ROUGE scores or other task-specific metrics to determine how
smoothing impacts downstream tasks.

Interpolation and Backoff


Interpolation and Backoff are two commonly used techniques for handling data sparsity in N-gram
models. These methods aim to improve language models' ability to generalize and assign probabilities
to unseen word sequences.
1. Interpolation
Definition
 Interpolation combines probabilities from higher-order and lower-order N-grams, rather than relying
solely on the highest available N-gram.
Page 19 of
Effective Date Page No.
75
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

 The idea is to use information from all N-gram levels, weighting them appropriately.

Mathematical Representation
For a trigram model (n=3n=3n=3):
P(wi∣wi−2,wi−1)=λ3P(wi∣wi−2,wi−1)+λ2P(wi∣wi−1)+λ1P(wi)P(w_i | w_{i-2}, w_{i-1}) =
\lambda_3 P(w_i | w_{i-2}, w_{i-1}) + \lambda_2 P(w_i | w_{i-1}) + \lambda_1 P(w_i)P(wi∣wi−2
,wi−1)=λ3P(wi∣wi−2,wi−1)+λ2P(wi∣wi−1)+λ1P(wi)
 λ3,λ2,λ1\lambda_3, \lambda_2, \lambda_1λ3,λ2,λ1: Interpolation weights (non-negative, sum to 1).
 P(wi∣wi−2,wi−1)P(w_i | w_{i-2}, w_{i-1})P(wi∣wi−2,wi−1): Probability based on the trigram.
 P(wi∣wi−1)P(w_i | w_{i-1})P(wi∣wi−1): Probability based on the bigram.
 P(wi)P(w_i)P(wi): Probability based on the unigram.
Characteristics
 All levels (unigram, bigram, trigram, etc.) contribute to the final probability.
 Weights (λ\lambdaλ) can be determined through techniques like grid search or expectation-
maximization (EM) based on a held-out dataset.
Advantages
 Smoother probability distribution compared to relying solely on higher-order N-grams.
 Reduces the impact of data sparsity by leveraging lower-order N-grams.
Applications
 Language modeling (e.g., speech recognition, machine translation).
 Predictive text generation.
2. Backoff
Definition
 Backoff uses lower-order N-grams only when higher-order N-grams are unavailable or have zero
probability.
 Unlike interpolation, backoff does not combine probabilities; it falls back to lower-order probabilities
as needed.
Mathematical Representation
For a trigram model:
P(wi∣wi−2,wi−1)={If trigram exists: P(wi∣wi−2,wi−1)Else, back off to: αP(wi∣wi−1)P(w_i | w_{i-2},
w_{i-1}) = \begin{cases} \text{If trigram exists: } P(w_i | w_{i-2}, w_{i-1}) \\ \text{Else, back off
to: } \alpha P(w_i | w_{i-1}) \\ \end{cases}P(wi∣wi−2,wi−1)={If trigram exists: P(wi∣wi−2,wi−1
)Else, back off to: αP(wi∣wi−1)
 α\alphaα: Backoff weight to ensure proper normalization of probabilities.
Characteristics
 Probabilities from lower-order N-grams are used only when necessary.
 A normalization factor (α\alphaα) ensures the model’s probabilities sum to 1.
Advantages
 Simpler implementation compared to interpolation.
 Efficient when the corpus contains sufficient higher-order N-grams.
Applications
 Language modeling in applications like text-to-speech (TTS) and auto-completion.

3. Interpolation vs. Backoff


Feature Interpolation Backoff
Combines probabilities from all N- Uses lower-order N-grams only
Combination
gram levels. when necessary.
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

Page 20 of
Effective Date Page No.
75

Feature Interpolation Backoff


Requires weights (λ\lambdaλ) for Normalizes probabilities with a
Weighting
each N-gram level. backoff factor (α\alphaα).
Sparsity Distributes probability across Falls back to lower levels when
Handling levels smoothly. higher ones fail.
Complexity Computationally more complex. Simpler to implement.
Typically provides better Depends on the size and quality of
Accuracy
generalization. the corpus.
4. Advanced Techniques
Katz Backoff
 A combination of backoff and smoothing.
 High-order N-grams are used when available, and lower-order N-grams are backed off to with
discounted probabilities.
 Probability adjustment ensures unused mass from higher-order N-grams is redistributed to lower-
order ones.
Linear Interpolation
 A specific form of interpolation where weights are pre-determined or trained on a development
dataset.
 Each N-gram level contributes to the final probability, weighted by fixed or learned factors.

Word Classes
In Natural Language Processing (NLP), word classes (also referred to as parts of speech
(POS) or syntactic categories) are used to group words based on their grammatical roles, syntactic
behavior, and function within a sentence. Word classes help in analyzing, understanding, and
generating human language computationally.

1. Common Word Classes in NLP


Here are the primary word classes typically used in NLP:
Word Class Definition Examples
dog, city,
Noun Represents people, places, things, or ideas.
happiness
Pronoun Substitutes for nouns. he, she, it, they
Verb Denotes actions, states, or events. run, is, think
Adjective Describes or modifies nouns. happy, red, tall
quickly, very,
Adverb Modifies verbs, adjectives, or other adverbs.
tomorrow
Shows relationships between a noun/pronoun and
Preposition in, on, by, with
another word in the sentence.
Conjunction Connects words, phrases, or clauses. and, but, or
Determiner Modifies nouns to clarify reference. the, a, some, this
Interjection Expresses emotion or exclamation. oh, wow, ouch
Numeral Represents numbers or quantities. one, two, third
Particle Adds meaning or emphasis, often functioning as part not, up, off
Page 21 of
Effective Date Page No.
75
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

Word Class Definition Examples


of a phrasal verb.
Auxiliary
Helps the main verb express tense, mood, or voice. is, have, can
Verb
2. Importance of Word Classes in NLP
 Syntactic Analysis: Identifying word classes is essential for parsing sentences into meaningful
structures.
 Semantic Understanding: Helps in understanding the meaning and relationships of words in a
sentence.
 Applications:
o Machine Translation: Determines correct grammar and structure in the target language.
o Text-to-Speech (TTS): Improves pronunciation and prosody.
o Question Answering: Helps identify entities and relationships in text.
3. Word Class Tagging
Part-of-Speech (POS) Tagging
 POS tagging involves assigning word classes to each word in a text.
 Example: Sentence: She is reading a book. POS Tags: She (PRON), is (AUX), reading (VERB), a
(DET), book (NOUN)
POS Tagging Tools
1. NLTK (Natural Language Toolkit):
o Uses pre-trained POS taggers based on the Penn Treebank tag set.
2. SpaCy:
o Provides efficient, pre-trained models for POS tagging.
3. Stanford NLP:
o A highly accurate POS tagging library.
Popular Tag Sets
1. Penn Treebank POS Tag Set: Common in English NLP tasks.
o Example tags: NN (Noun, singular), VB (Verb, base form), DT (Determiner).
2. Universal POS Tag Set: A cross-linguistic standard.
o Example tags: NOUN, VERB, ADJ.
4. Challenges in Word Classes for NLP
1. Ambiguity:
o Some words can belong to multiple classes depending on context.
o Example: book (Noun: a book) vs. book (Verb: to book a ticket).
2. Idiomatic Expressions:
o Words may lose their standard class roles in idioms.
o Example: kick the bucket (phrase meaning to die).
3. Morphologically Rich Languages:
o Languages like Finnish or Turkish have complex inflection systems, making word class tagging
challenging.
4. Domain-Specific Vocabulary:
o Technical or slang words may not fit neatly into standard word classes.
5. Applications of Word Classes in NLP
1. Machine Translation:
o Ensures grammatical correctness in translations by tagging words with their appropriate classes.
2. Named Entity Recognition (NER):
o Distinguishes between nouns (e.g., dog) and proper nouns (e.g., Google).
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

Page 22 of
Effective Date Page No.
75

3. Text Summarization:
o Identifies keywords and key phrases based on nouns, verbs, and adjectives.
4. Sentiment Analysis:
o Adjectives and adverbs often carry sentiment, aiding polarity detection.
5. Information Retrieval:
o Helps identify relevant words or phrases in search queries.

Part-of-Speech Tagging
Part-of-Speech Tagging is the process of assigning word classes or grammatical categories
(e.g., noun, verb, adjective) to each word in a given text based on its context. It is a fundamental step
in many NLP applications as it helps in understanding the syntactic structure and meaning of a
sentence.
1. Why POS Tagging is Important
1. Syntactic Analysis:
o Identifies the grammatical structure of sentences for parsing and sentence analysis.
2. Semantic Understanding:
o Determines word meaning based on context (e.g., "book" as a noun or verb).
3. Downstream Applications:
o Named Entity Recognition (NER): Identifies proper nouns.
o Machine Translation: Ensures grammatically correct output.
o Text Summarization: Extracts key phrases based on POS.
o Sentiment Analysis: Leverages adjectives and adverbs to detect sentiment.
2. How POS Tagging Works
Steps in POS Tagging:
1. Tokenization:
o Split the input text into individual words or tokens.
o Example: "The cat sat on the mat." → ["The", "cat", "sat", "on", "the", "mat"]
2. Assigning POS Tags:
o Each token is assigned a tag based on:
 Rule-Based Methods: Grammar rules.
 Statistical Models: Probabilities derived from training data.
 Deep Learning Models: Neural networks that learn contextual relationships.
POS Tags
 Commonly used POS tagging schemes:
1. Penn Treebank POS Tag Set (for English):
 NN: Noun (singular)
 VB: Verb (base form)
 JJ: Adjective
 RB: Adverb
 IN: Preposition
2. Universal POS Tag Set:
 NOUN, VERB, ADJ, ADV, PRON, etc.
3. Techniques for POS Tagging
1. Rule-Based Tagging
 Relies on manually defined grammar rules.
 Example:
Page 23 of
Effective Date Page No.
75
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

o "If a word ends with '-ing', tag it as a verb (VB)."


 Limitation:
o Cannot handle complex or ambiguous contexts effectively.
2. Statistical Tagging
 Uses probabilistic models trained on labeled data.
 Examples:
o Hidden Markov Models (HMMs):
 Computes the most likely sequence of tags based on transition and emission probabilities.
o Conditional Random Fields (CRFs):
 Incorporates contextual features and learns the sequence of tags.
3. Neural Network-Based Tagging
 Uses deep learning to capture context and dependencies in sentences.
 Examples:
o Recurrent Neural Networks (RNNs):
 Learn sequential data, such as sentences, for POS tagging.
o Bidirectional LSTMs (Bi-LSTMs):
 Use context from both directions (previous and next words) for better tagging accuracy.
o Transformers:
 Models like BERT pre-trained on large corpora excel at contextual tagging.
4. Challenges in POS Tagging
1. Ambiguity:
o Words can have multiple POS tags depending on context.
o Example: book (Noun: "a book", Verb: "to book a room").
2. Out-of-Vocabulary Words:
o Words not seen during training can be challenging to tag.
3. Complex Sentence Structures:
o Long or syntactically ambiguous sentences may reduce accuracy.
4. Domain-Specific Text:
o Jargon or technical terms require domain-specific models.
5. Applications of POS Tagging
1. Named Entity Recognition (NER):
o Identifies entities like names, locations, or dates based on POS tags.
2. Dependency Parsing:
o Establishes syntactic relationships between words.
3. Text Summarization:
o Extracts key phrases by focusing on nouns, verbs, and adjectives.
4. Machine Translation:
o Ensures grammatical correctness in translations.
5. Question Answering Systems:
o Identifies the role of words in questions (e.g., subject, object).
6. POS Tagging Libraries
1. NLTK (Natural Language Toolkit):
o A Python library with pre-trained POS taggers.
python
Copy code
import nltk
from nltk import pos_tag, word_tokenize
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

Page 24 of
Effective Date Page No.
75

sentence = "The quick brown fox jumps over the lazy dog."
tokens = word_tokenize(sentence)
pos_tags = pos_tag(tokens)
print(pos_tags)
2. SpaCy:
o Provides fast and efficient POS tagging.
python
Copy code
import spacy

nlp = spacy.load("en_core_web_sm")
sentence = "The quick brown fox jumps over the lazy dog."
doc = nlp(sentence)
for token in doc:
print(f"{token.text}: {token.pos_}")
3. Stanford CoreNLP:
o A highly accurate library for POS tagging, using statistical models.
4. Flair:
o A deep learning library specialized in POS tagging and other NLP tasks.
5. BERT-based Models:
o Pre-trained transformer models like BERT can perform POS tagging with fine-tuning.
7. Example of POS Tagging in Python
Here’s a simple implementation using NLTK:
python
Copy code
import nltk
from nltk import pos_tag, word_tokenize

# Download NLTK resources


nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Example sentence
sentence = "The cat sat on the mat."

# Tokenize and perform POS tagging


tokens = word_tokenize(sentence)
pos_tags = pos_tag(tokens)

# Print the results


print("Word\tPOS Tag")
for word, tag in pos_tags:
print(f"{word}\t{tag}")
Output:
css
Copy code
Page 25 of
Effective Date Page No.
75
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

Word POS Tag


The DT
cat NN
sat VBD
on IN
the DT
mat NN
. .

Rule based

Rule-Based NLP involves the use of hand-crafted linguistic rules and patterns to process,
analyze, and generate human language. It is one of the oldest approaches in NLP and relies on
predefined rules, lexicons, and grammar to achieve language understanding or generation.
1. What is Rule-Based NLP?
In a rule-based system, language processing is based on a set of manually defined rules. These rules
are created by linguists or domain experts and are used to identify patterns in text or to define how
language elements interact.
 Example:
o Rule: If a word ends with "-ing," it is likely a verb.
o Rule: If "not" appears before an adjective, classify it as negative sentiment.
Key Components:
1. Lexicons: Word lists or dictionaries with associated features (e.g., part of speech, polarity).
2. Grammar Rules: Syntax and morphology rules (e.g., subject-verb agreement, noun phrase structure).
3. Pattern Matching: Matching text to specific patterns (e.g., regular expressions).
4. Rule Engine: A system that applies rules to text.
2. Applications of Rule-Based NLP
1. Part-of-Speech (POS) Tagging
 Rule-based taggers assign POS tags to words using linguistic rules.
 Example Rule:
o If the preceding word is a determiner (e.g., "the"), tag the current word as a noun.
2. Named Entity Recognition (NER)
 Detect entities like names, dates, and locations using patterns.
 Example Rule:
o If a word starts with a capital letter and is followed by "Inc." or "Ltd.," classify it as an organization.
3. Sentiment Analysis
 Identify positive or negative sentiment using sentiment word lexicons and negation rules.
 Example Rule:
o If "not" appears before a positive word (e.g., "not good"), classify it as negative.
4. Text Normalization
 Handle text preprocessing tasks like stemming and lemmatization using rules.
 Example Rule:
o If a word ends in "ing," remove "ing" (e.g., "running" → "run").
5. Spell Checking
 Correct spelling errors by comparing against a dictionary and applying transformation rules.
6. Information Extraction
 Extract structured information from unstructured text using templates and rules.
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

Page 26 of
Effective Date Page No.
75

 Example:
o Extract dates in the format "DD-MM-YYYY" using regex patterns.
7. Question Answering
 Use rules to detect question types and retrieve relevant information.
 Example Rule:
o If a question starts with "Who," retrieve entities tagged as "Person."
3. Advantages of Rule-Based NLP
1. Interpretability:
o Rules are explicit and easy to understand.
o Useful in domains where decisions need to be explainable.
2. Domain Adaptability:
o Rules can be customized for specific languages, industries, or tasks.
3. Low Data Dependency:
o Does not require large labeled datasets for training.
4. Deterministic Behavior:
o Outputs are predictable and consistent.
4. Limitations of Rule-Based NLP
1. Scalability:
o Creating and maintaining a large number of rules is time-consuming and labor-intensive.
2. Coverage:
o Rules may fail to handle edge cases, ambiguities, or new language patterns.
3. Context Sensitivity:
o Difficult to account for context or nuances of natural language effectively.
4. Maintenance:
o Rules need to be updated frequently to keep up with evolving language and domain-specific terms.
5. Generalization:
o Rule-based systems often struggle with unseen data or out-of-vocabulary words.
5. Examples of Rule-Based NLP Techniques
1. Regular Expressions (Regex)
 Used for pattern matching in text.
 Example:
o Extract email addresses:

python
Copy code
import re
text = "Contact us at [email protected]."
emails = re.findall(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', text)
print(emails)
Output: ['[email protected]']
2. Rule-Based POS Tagging
 Example using NLTK:
python
Copy code
import nltk
from nltk import RegexpTagger
Page 27 of
Effective Date Page No.
75
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

# Define POS tagging rules


rules = [
(r'.*ing$', 'VBG'), # Gerunds
(r'.*ed$', 'VBD'), # Past tense verbs
(r'.*es$', 'VBZ'), # 3rd person singular verbs
(r'.*ly$', 'RB'), # Adverbs
(r'.*', 'NN') # Default: Noun
]

# Apply rule-based tagger


tagger = RegexpTagger(rules)
sentence = "The cat is running quickly."
tokens = nltk.word_tokenize(sentence)
tags = tagger.tag(tokens)
print(tags)
Output:
css
Copy code
[('The', 'NN'), ('cat', 'NN'), ('is', 'NN'), ('running', 'VBG'), ('quickly', 'RB'), ('.', 'NN')]
3. Named Entity Recognition
 Extract dates using regex:
python
Copy code
import re
text = "The meeting is scheduled for 14-01-2025."
dates = re.findall(r'\b\d{2}-\d{2}-\d{4}\b', text)
print(dates)
Output: ['14-01-2025']

6. Rule-Based vs. Statistical/ML-Based NLP


Aspect Rule-Based NLP Statistical/ML-Based NLP
Low (complex models like neural
Interpretability High (rules are explicit).
networks).
Data Low (works without large
High (requires labeled data for training).
Dependency datasets).
Limited (hard to adapt to High (generalizes better with enough
Flexibility
new patterns). data).
Difficult to scale as rules Scales well with more data and
Scalability
increase. computational power.
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

Page 28 of
Effective Date Page No.
75

Aspect Rule-Based NLP Statistical/ML-Based NLP


Good for small, well-defined Better for large, complex, and
Performance
tasks. ambiguous tasks.
7. Hybrid Systems
Modern NLP systems often combine rule-based and ML-based approaches to leverage the strengths
of both:
 Example: Use ML models for initial tagging and apply rule-based post-processing for domain-
specific corrections.
8. Use Cases for Rule-Based NLP
 Domains with Low Data Availability:
o Legal or healthcare text analysis where labeled datasets are limited.
 Critical Applications:
o Applications requiring high interpretability (e.g., financial compliance).
 Text Preprocessing:
o Tokenization, normalization, and filtering in NLP pipelines.

Stochastic and Transformation-based tagging


POS tagging is a crucial task in NLP, and there are multiple approaches to achieving it, including
stochastic methods and transformation-based learning (TBL). Here's a breakdown of these two
approaches:
1. Stochastic Tagging
Stochastic Tagging relies on probability and statistics to assign part-of-speech (POS) tags to words.
It involves using statistical models trained on labeled data to compute the most likely sequence of tags
for a sentence.
Key Techniques in Stochastic Tagging
1. Hidden Markov Models (HMM)
 Overview:
o Assumes a sequence of words is generated by a hidden sequence of states (POS tags).
o Uses transition probabilities (from one tag to another) and emission probabilities (probability of a
word given a tag) to find the most likely sequence of tags.
 Key Steps:
o Calculate probabilities from a tagged corpus.
o Use the Viterbi Algorithm to find the most probable sequence of tags.
 Example:
o Given the sentence "The cat sleeps":
 Transition Probability: P(NN → VB) = 0.2 (probability of a noun followed by a verb).
 Emission Probability: P(sleeps | VB) = 0.8 (probability of the word "sleeps" given the tag VB).
2. N-gram Models
 Uses n-grams (sequences of n words or tags) to compute probabilities.
 For POS tagging, bigrams and trigrams are commonly used.
 Example:
o If the bigram probability P(DT → NN) is high, the word "dog" is likely to be a noun when preceded
by a determiner like "the."
3. Maximum Entropy Models
 Probabilistic models that consider a wider range of contextual features.
 Predict the POS tag that maximizes the conditional probability given the features.
Page 29 of
Effective Date Page No.
75
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

4. Conditional Random Fields (CRFs)


 Discriminative models that predict the sequence of tags directly, using both current and surrounding
context.
 Example:
o Predicts tags for the sentence "The quick brown fox" by considering the relationship between
neighboring words and their tags.
Advantages of Stochastic Tagging
1. Can handle ambiguity using probabilities.
2. Generalizes well to unseen data with sufficient training.
3. Scalable for large datasets.
Challenges of Stochastic Tagging
1. Requires a large, annotated corpus for training.
2. Struggles with domain-specific text or out-of-vocabulary words.
3. Complex models like CRFs and HMMs may be computationally expensive.
2. Transformation-Based Tagging (TBL)
Transformation-Based Tagging, also known as Brill Tagging, is a hybrid approach that combines
rule-based and stochastic methods. It learns rules from data iteratively to correct initial tagging errors.
How TBL Works
1. Initialization:
o Start with a baseline tagger (e.g., assign the most frequent tag for each word).
o Example: Tag "book" as NN (noun) because it's most commonly a noun.
2. Rule Generation:
o Identify contexts where the initial tag is incorrect and generate rules to correct these errors.
o Example Rule:
 If the previous word is "to" and the current word is "book," change the tag from NN (noun) to VB
(verb).
3. Rule Application:
o Apply the learned rules iteratively, refining the tagging process in each iteration.
4. Stopping Condition:
o Stop when no further improvements are made or when a predefined number of iterations is reached.
Example of TBL
Input Sentence:
 "I want to book a flight."
Initial Tags (Baseline):
 I/PRP want/VB to/TO book/NN a/DT flight/NN
Transformation Rule:
 Rule: If the previous word is "to" and the current word is tagged as NN, change the tag to VB.
Final Tags:
 I/PRP want/VB to/TO book/VB a/DT flight/NN
Advantages of Transformation-Based Tagging
1. Interpretable Rules:
o Rules are human-readable and explainable.
2. Domain Adaptability:
o Rules can be adapted for specific domains or tasks.
3. Efficiency:
o Does not require probabilistic computations during inference.
Challenges of TBL
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

Page 30 of
Effective Date Page No.
75

1. Dependency on Baseline:
o The quality of the baseline tagger impacts performance.
2. Rule Creation:
o Iterative rule learning can be slow for large datasets.
3. Error Propagation:
o Errors in early iterations may propagate to later stages.
3. Comparison of Stochastic and TBL
Transformation-Based Tagging
Aspect Stochastic Tagging
(TBL)
Relies on probabilities and Learns explicit transformation rules
Methodology
statistical models. iteratively.
Requires labeled data but learns
Training Data Requires large labeled datasets.
interpretable rules.
Low (statistical models are
Interpretability High (rules are human-readable).
complex).
Generalizes well with sufficient Can be fine-tuned for specific
Adaptability
data. domains.
Slower due to iterative rule
Speed Faster at runtime (once trained).
application.
Error Handles ambiguity using Iteratively corrects errors with learned
Handling probabilities. rules.
4. Applications of Stochastic and TBL Tagging
1. Part-of-Speech Tagging:
o Stochastic models (e.g., HMMs, CRFs) are widely used in general-purpose POS taggers.
o TBL is useful in domains where interpretability is crucial.
2. Named Entity Recognition (NER):
o Stochastic models identify entities using probabilistic tagging.
o TBL can refine entity tags by applying domain-specific rules.
3. Spell Checking and Correction:
o TBL can be used to create rules for correcting common spelling errors.
4. Syntactic Parsing:
o Stochastic models like CRFs assist in dependency and constituency parsing.

5. Example: Python Implementation


Stochastic Tagging with NLTK
python
Copy code
import nltk
from nltk import pos_tag, word_tokenize
Page 31 of
Effective Date Page No.
75
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

# Example sentence
sentence = "I want to book a flight."

# Tokenize and apply POS tagging


tokens = word_tokenize(sentence)
tags = pos_tag(tokens)

print("Stochastic POS Tagging:")


print(tags)
Output:
css
Copy code
[('I', 'PRP'), ('want', 'VB'), ('to', 'TO'), ('book', 'VB'), ('a', 'DT'), ('flight', 'NN')]
Transformation-Based Tagging with NLTK
python
Copy code
from nltk.tbl import Template
from nltk.tag.brill import PosTaggerTrainer, brill24

# Sample corpus
training_data = [
[("I", "PRP"), ("want", "VB"), ("to", "TO"), ("book", "VB"), ("a", "DT"), ("flight", "NN")],
[("The", "DT"), ("dog", "NN"), ("barks", "VBZ")]
]

# Train a Brill Tagger


default_tagger = nltk.DefaultTagger('NN') # Baseline: Tag everything as NN
trainer = PosTaggerTrainer(default_tagger, brill24())
brill_tagger = trainer.train(training_data)

# Test the tagger


test_sentence = [("I", ""), ("want", ""), ("to", ""), ("book", ""), ("a", ""), ("flight", "")]
print("Transformation-Based POS Tagging:")
print(brill_tagger.tag(test_sentence))

Issues in POS Tagging


POS tagging, the process of assigning grammatical tags to words in a text, is a fundamental
task in Natural Language Processing (NLP). Despite its importance, several challenges arise during its
implementation due to the complexity and ambiguity of human language.
1. Ambiguity
1.1 Lexical Ambiguity
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

Page 32 of
Effective Date Page No.
75

 A single word can have multiple possible POS tags depending on its context.
o Example:
 "Book a flight" → "book" (Verb)
 "Read the book" → "book" (Noun)
1.2 Structural Ambiguity
 The structure of a sentence can lead to multiple valid tag sequences.
o Example:
 "Visiting relatives can be fun"
 "Visiting" as a Verb (action) or Adjective (modifier).
1.3 Tagging Ambiguity
 Words that can fit into multiple categories even within similar contexts.
o Example:
 "He saw her duck"
 "duck" could be a Noun (animal) or Verb (action).
2. Out-of-Vocabulary (OOV) Words
 Words not present in the training data can lead to incorrect or undefined tags.
o Common cases:
 Neologisms: Newly coined words (e.g., "selfie").
 Technical Terms: Domain-specific jargon.
 Foreign Words: Words borrowed from other languages.
3. Domain-Specific Challenges
 POS tagging systems trained on general-purpose corpora may fail in specific domains.
o Example:
 Medical domain: "MRI scan" (Medical jargon)
 Legal domain: "Hereby declare" (Formal language)
4. Language Variability
4.1 Morphological Richness
 Some languages (e.g., Finnish, Turkish) have rich morphology where a single word can represent an
entire phrase in English.
o Example:
 Turkish: "Evlerinizden" → "From your houses."
4.2 Free Word Order
 In free word order languages (e.g., Sanskrit, Hungarian), the sequence of words does not always
determine their grammatical role, making tagging more complex.
4.3 Lack of Resources
 For low-resource languages, there may be insufficient annotated corpora for training.
5. Multiword Expressions (MWEs)
 Phrases that function as a single unit can cause confusion in tagging.
o Example:
 "New York" should be tagged as a proper noun (NNP), not as two separate entities.

6. Context Sensitivity
 POS tags often depend on the broader sentence or paragraph context, which simple models may fail to
capture.
o Example:
 "He likes to fish"
 "The fish is fresh"
Page 33 of
Effective Date Page No.
75
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

7. Inconsistent Annotation Standards


 Different corpora use different POS tagsets or annotation guidelines, leading to inconsistency in
tagging models.
o Example:
 Universal POS Tagset (simpler): "book" → VERB
 Penn Treebank Tagset (granular): "book" → VB
8. Polysemy and Homonymy
 Polysemy: Words with multiple related meanings.
o Example: "run" → a physical action (VB) or a race (NN).
 Homonymy: Words with unrelated meanings but identical spelling.
o Example: "bank" → a financial institution (NN) or the side of a river (NN).
9. Noisy Text
 Tagging becomes difficult in non-standard or informal text formats, such as:
o Social Media Text: Contains abbreviations, emojis, and slang.
 Example: "u r gr8" → "you are great."
o Speech Transcriptions: May include disfluencies and fillers.
 Example: "Um, I think I like, uh, coffee."
10. Compound Words
 Words like "ice-cream" or "well-being" can be misinterpreted as separate tokens or misclassified.
11. Handling Code-Switching
 In multilingual contexts, speakers often switch between languages mid-sentence.
o Example:
 "I need to book a taxi जल्दीसे" (English + Hindi).
12. Evaluation and Metrics
 Evaluating POS taggers is challenging due to:
o Different annotation schemes.
o Disagreement between annotators in ambiguous cases.
o Metric limitations: Precision, recall, and F1 may not always reflect real-world performance.
13. Dependency on Training Data
 Quality of Training Data:
o Poorly annotated corpora result in models learning incorrect patterns.
 Bias in Data:
o Models trained on biased datasets may perform poorly in diverse contexts.
14. Memory and Computational Constraints
 Resource-heavy models like CRFs or neural networks may not work well on devices with limited
computational power.

Strategies to Overcome Challenges


1. Ambiguity Handling
 Use context-aware models like Transformers (e.g., BERT) to capture sentence context.
 Leverage multi-task learning to integrate syntactic and semantic information.
2. OOV Word Management
 Use subword tokenization (e.g., Byte Pair Encoding or WordPiece).
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

Page 34 of
Effective Date Page No.
75

 Add fallback rules for rare or unseen words.


3. Domain Adaptation
 Train or fine-tune models on domain-specific corpora.
 Use transfer learning techniques.
4. Handling Noisy Text
 Preprocess text to normalize slang, abbreviations, and spelling errors.
 Use domain-specific embeddings trained on noisy text (e.g., Twitter embeddings).
5. Multilingual Solutions
 Use models trained on Universal Dependencies (UD) to standardize tagging across languages.
 Build language-agnostic embeddings (e.g., mBERT, XLM-R).
6. Resource Scarcity
 Use cross-lingual transfer learning or unsupervised methods for low-resource languages.
 Employ crowdsourcing to create labeled datasets.
7. Incorporating Linguistic Knowledge
 Add linguistic rules or constraints to supplement statistical or neural models.
 Combine rule-based and data-driven approaches for better performance.

Hidden Markov Models (HMM) and Maximum Entropy Models


Hidden Markov Models (HMM) and Maximum Entropy Models (MaxEnt) are two widely
used statistical techniques in Natural Language Processing (NLP). They are often applied to sequence
labeling tasks, such as Part-of-Speech (POS) tagging, Named Entity Recognition (NER), and other
text classification problems.
1. Hidden Markov Models (HMM)
An HMM is a probabilistic model used for modeling sequences, where the system being
modeled is assumed to follow a Markov process with hidden states.
Key Concepts in HMM
1. States:
o Represent hidden variables, e.g., POS tags.
o Example: {NN, VB, DT} (noun, verb, determiner).
2. Observations:
o Represent observable data, e.g., words in a sentence.
o Example: ["The", "cat", "jumps"].
3. Transition Probabilities (P(tag_i | tag_(i-1))):
o Probability of transitioning from one state to another.
o Example: P(VB → NN).
4. Emission Probabilities (P(word | tag)):
o Probability of observing a word given a state.
o Example: P("cat" | NN).

5. Initial Probabilities (P(tag)):


o Probability of starting in a specific state.
o Example: P(DT) = 0.3.
How HMM Works
 HMM assumes that:
1. The current state depends only on the previous state (Markov property).
Page 35 of
Effective Date Page No.
75
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

2. The observed word depends only on the current state.


Decoding with HMM
 The goal is to find the most likely sequence of states (tags) given the observed sequence of words.
 Viterbi Algorithm:
o A dynamic programming algorithm to compute the most probable tag sequence efficiently.
Advantages of HMM
 Simplicity: Easy to implement and interpret.
 Probabilistic Framework: Provides a natural way to handle uncertainty in language.
Disadvantages of HMM
1. Strong Independence Assumptions:
o Assumes that the current state depends only on the previous state and the current word.
2. Data Sparsity:
o Struggles with unseen words or rare transitions.
3. Fixed Features:
o Cannot incorporate rich contextual features easily.
2. Maximum Entropy Models (MaxEnt)
Maximum Entropy Models, also known as log-linear models, are discriminative models used
for classification tasks. MaxEnt models predict the conditional probability of a class (e.g., a tag) given
an input feature vector.
Key Concepts in MaxEnt
1. Feature Representation:
o Captures contextual information, e.g., surrounding words, word suffixes, and capitalization.
o Example Features:
 Current Word: "cat"
 Previous Word: "The"
 Is Capitalized: False
2. Conditional Probability:
o Computes the probability of a class (tag) given the features.
o Formula: P(y∣x)=1Z(x)exp⁡(∑iλifi(x,y))P(y | x) = \frac{1}{Z(x)} \exp \left( \sum_i \lambda_i f_i(x,
y) \right)P(y∣x)=Z(x)1exp(i∑λifi(x,y))
 fi(x,y)f_i(x, y)fi(x,y): Feature function.
 λi\lambda_iλi: Weight of the feature.
 Z(x)Z(x)Z(x): Normalization factor.
3. Training:
o Maximize the log-likelihood of the training data using optimization algorithms (e.g., gradient
descent).
Advantages of MaxEnt
1. Rich Features:
o Can incorporate arbitrary, overlapping, and non-independent features.
2. Flexibility:
o No need for independence assumptions (unlike HMM).
3. Discriminative:
o Directly models the conditional probability P(y∣x)P(y | x)P(y∣x).
Disadvantages of MaxEnt
1. Computational Cost:
o Training can be expensive, especially with many features.
2. Overfitting:
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

Page 36 of
Effective Date Page No.
75

o Requires regularization to avoid overfitting on the training data.


3. Data Dependence:
o Performance depends heavily on feature engineering and quality of labeled data.
3. Comparison of HMM and MaxEnt
Aspect HMM MaxEnt
Generative: Models joint Discriminative: Models
Model Type
probability P(x,y)P(x, y)P(x,y). conditional probability ( P(y
Independence Strong independence
No independence assumptions.
Assumptions assumptions.
Limited to emission and Can use rich, overlapping
Features
transition probabilities. contextual features.
Computationally intensive
Efficiency Faster to train and decode.
during training.
Sequence labeling with simple Sequence labeling with complex
Use Cases
features. features.
Handles sparse data better with
Robustness Struggles with sparse data.
proper regularization.
4. Applications of HMM and MaxEnt in NLP
HMM Applications
1. Part-of-Speech Tagging:
o Assign POS tags to words in a sentence.
2. Speech Recognition:
o Model sequences of phonemes or words in audio signals.
3. Machine Translation:
o Generate translation probabilities for word alignments.
MaxEnt Applications
1. Named Entity Recognition (NER):
o Identify entities like names, locations, and organizations.
2. Chunking:
o Identify phrases (e.g., noun or verb phrases) in sentences.
3. Sentiment Analysis:
o Classify text as positive, negative, or neutral.

5. Example of HMM and MaxEnt


HMM Example (POS Tagging)
python
Copy code
import nltk
from nltk.tag import hmm
Page 37 of
Effective Date Page No.
75
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

# Training data
training_data = [
[("The", "DT"), ("dog", "NN"), ("barks", "VBZ")],
[("A", "DT"), ("cat", "NN"), ("sleeps", "VBZ")]
]

# Train HMM
trainer = hmm.HiddenMarkovModelTrainer()
hmm_tagger = trainer.train(training_data)

# Test the HMM


sentence = ["The", "cat", "barks"]
tags = hmm_tagger.tag(sentence)
print("HMM POS Tags:", tags)
MaxEnt Example (POS Tagging with NLTK)
python
Copy code
import nltk
from nltk.classify import MaxentClassifier
from nltk.tag import map_tag

# Feature extractor
def pos_features(sentence, i):
word = sentence[i]
features = {
'word': word,
'is_first': i == 0,
'is_last': i == len(sentence) - 1,
'is_capitalized': word[0].isupper(),
'prev_word': '' if i == 0 else sentence[i - 1],
'next_word': '' if i == len(sentence) - 1 else sentence[i + 1]
}
return features

# Training data
training_data = [
(["The", "dog", "barks"], ["DT", "NN", "VBZ"]),
(["A", "cat", "sleeps"], ["DT", "NN", "VBZ"])
]

# Prepare features
train_features = [
({**pos_features(sentence, i)}, tag)
for sentence, tags in training_data
for i, tag in enumerate(tags)
]
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

Page 38 of
Effective Date Page No.
75

# Train MaxEnt Classifier


classifier = MaxentClassifier.train(train_features, max_iter=10)

# Test the classifier


sentence = ["The", "cat", "sleeps"]
test_features = [pos_features(sentence, i) for i in range(len(sentence))]
tags = [classifier.classify(features) for features in test_features]
print("MaxEnt POS Tags:", list(zip(sentence, tags)))

End Of Unit-II

UNIT III

SYNTACTIC ANALYSIS: Context-Free Grammars, Grammar rules for English, Treebanks,


Normal Forms for grammar – Dependency Grammar – Syntactic Parsing, Ambiguity, Dynamic
Programming parsing – Shallow parsing – Probabilistic CFG, Probabilistic CYK, Probabilistic
Lexicalized CFGs - Feature structures, Unification of feature structures.

Syntactic Analysis:
Page 39 of
Effective Date Page No.
75
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

Syntactic analysis or parsing or syntax analysis is the third phase of NLP. The purpose of this phase is
to draw exact meaning, or you can say dictionary meaning from the text. Syntax analysis checks the
text for meaningfulness comparing to the rules of formal grammar. For example, the sentence like
“hot ice-cream” would be rejected by semantic analyzer.

syntactic analysis or parsing may be defined as the process of analyzing the strings of symbols in
natural language conforming to the rules of formal grammar. The origin of the word ‘parsing’ is from
Latin word ‘pars’ which means ‘part’.

Types of Parsing:

1. Top-down Parsing
2. Bottom-up Parsing

Top-down Parsing:

In this kind of parsing, the parser starts constructing the parse tree from the start symbol and then tries
to transform the start symbol to the input.

Bottom-up Parsing:
In this kind of parsing, the parser starts with the input symbol and tries to construct the parser tree up
to the start symbol.

Context Free Grammar:

A context-free grammar (CFG) is a formal system used to describe a class of languages known
as context-free languages (CFLs). purpose of context-free grammar is:
 To list all strings in a language using a set of rules (production rules).
 It extends the capabilities of regular expressions and finite automata

G -> (V∪T)* , where G ∊ V


V (Variables/Non-terminals):

These are symbols that can be replaced using production rules. They help in defining the structure
of the grammar. Typically, non-terminals are represented by uppercase letters (e.g., S, A, B).
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

Page 40 of
Effective Date Page No.
75

T (Terminals):

These are symbols that appear in the final strings of the language and cannot be replaced further.
They are usually represented by lowercase letters (e.g., a, b, c) or specific symbols.

What is Grammar?

Grammar is defined as the rules for forming well-structured sentences. Grammar also plays an
essential role in describing the syntactic structure of well-formed programs, like denoting the
syntactical rules used for conversation in natural languages.

In the theory of formal languages, grammar is also applicable in Computer Science, mainly in
programming languages and data structures. Example - In the C programming language, the precise
grammar rules state how functions are made with the help of lists and statements.

Mathematically, a grammar G can be written as a 4-tuple (N, T, S, P) where:

 N or VN = set of non-terminal symbols or variables.


 T or ∑ = set of terminal symbols.
 S = Start symbol where S ∈ N
 P = Production rules for Terminals as well as Non-terminals.

It has the form α→βα→β, where α and β are strings on VN∪∑VN∪∑, and at least one symbol of α
belongs to VN

NORMAL FORM FOR GRAMMAR


In Natural Language Processing (NLP), the most common "normal form" for grammar is Chomsky
Normal Form (CNF), where all production rules in a context-free grammar are either of the form "A
→ BC" (where A, B, and C are non-terminal symbols) or "A → a" (where A is a non-terminal and a is
a terminal symbol).
Key points about Chomsky Normal Form:
Structure:
Each rule in CNF either generates two non-terminal symbols or a single terminal symbol.
Benefits:
Converting a grammar to CNF allows for efficient parsing algorithms by ensuring a consistent tree
structure with a branching factor of 2.
Other normal forms:While less commonly used in NLP, another important normal form is Greibach
Normal Form where rules are of the form "A → aβ" (where a is a terminal and β is a string of non-
terminals).

DEPENDENCY GRAMMAR:
Page 41 of
Effective Date Page No.
75
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

Dependency grammar is a fundamental concept in natural language processing (NLP) that allows us
to understand how words connect within sentences. It provides a framework for representing sentence
structure based on word-to-word relationships.

SYNTACTIC PARSING
Syntactic parsing is a natural language processing (NLP) technique that analyzes the grammatical
structure of sentences. It's also known as syntactic analysis or parsing.
How it works ?
Syntactic parsing breaks down sentences into their grammatical parts, such as nouns, verbs, and
adjectives
 It uses formal grammar rules to assign a semantic structure to the text.
 It helps machines understand the structure and meaning of human language.
 Syntactic parsing helps determine the relationships between words, phrases, and classes.
 It's a foundation for more advanced linguistic analyses, such as sentiment analysis.
 It helps machines understand whether a sentence has a logical meaning or if its grammatical structure
is correct.
Examples:
 In the sentence "The dog went away," the subject is "the dog" and the predicate is "went away".
In the sentence "The boy ate the pancakes from the jumping table," "the boy" is a noun
phrase, "ate" is a verb, and "jumping table" is a verb phrase

In NLP, the main types of parsing used in syntactic parsing are


 constituency parsing
 dependency parsing

Constituency Parsing:
Breaks a sentence down into hierarchical phrase structures like noun phrases, verb phrases, and
clauses.
Represents the grammatical relationships between words using a tree structure where each node
represents a phrase.
Example: "The big red ball" would be parsed as a noun phrase with "the big red ball" as the head.
Dependency Parsing:
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

Page 42 of
Effective Date Page No.
75

 Identifies the head word of each phrase and shows the direct dependencies between words within a
sentence.
 Represents relationships using labeled edges connecting dependent words to their head words.
Example: "The big red ball" would show "ball" as the head with "the", "big", and "red" depending on
it.
AMBIGUITY
Ambiguity in Natural Language Processing (NLP) occurs when a word, phrase, or sentence has more
than one meaning
Types of ambiguity
 Lexical ambiguity
When a word has more than one meaning. For example, the word "ready" can have multiple
meanings.
 Syntactic ambiguity
When the function of a word in a sentence is unclear. For example, in the sentence "They are visiting
relatives", it's not clear if "visiting" is a verb or an adjective.
DYNAMIC PROGRAMMING PARSING
Dynamic programming parsing is a foundationl approach in (NLP) used for syntactic analysis of
sentences.
CONTEXT_FREE GRAMMAR(CFGs)
 Dpp is often app;ies to CFGs,which define the syntax of a language .CFGs consist of

NON-TERMINALS
 A set of non _ terminals symbols.
Eg: s, np, vp
CONTEXT_FREE GRAMMAR(CFGs)
TERMINALS:
 A set of terminals symbols (words in the language)

PRODUCTION RULES:
 A set of production rules

START SYMBOL:
 A start symbol
Example:S
EARLEY PARSING
 Earley parsing top-down parsing alorithm CFG
 This is an general-purpose parser
 This mean by chomsky Normal Form(CNF).

KEY FEATURES:
 Uses a chat to store partial results
 Has three main operation :
Page 43 of
Effective Date Page No.
75
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

 prediction
 scanning
 completion
 Works in 0(n3, in the worst case but cant be faster for specific grammars

CHART PARSING
Parsing generalized dynamic programming framework.
STEP:
Intermerdiate results chart
EXAMPLE :
“John saw mary”.
Grammar (CNF format):
S all NP,VP
VP pal V,NP
NP all " John"/"Mary"
V all " saw
SHALLOW PARSING

Shallow parsing in NLP (Natural Language Processing) is simply the process of breaking a sentence
into smaller, easy-to-understand chunks or parts. These parts are typically phrases like:
 Noun phrases (NP) — parts that tell us who or what (e.g., "The dog," "My friend").
 Verb phrases (VP) — parts that tell us what action is happening (e.g., "is running," "eats lunch").
 Instead of looking at every single word’s relationship in the sentence (which is deep parsing), shallow
parsing focuses on grouping words into bigger chunks like noun phrases and verb phrases, just to get
a general idea of the sentence structure.
For example:
Sentence: "The cat sleeps on the mat."
Shallow parsing might break it into:
 Noun phrase: "The cat"
 Verb phrase: "sleeps"
Prepositional phrase: "on the mat

PROBABILISTIC CFG
A Probabilistic CFG (Context-Free Grammar) in NLP is a type of grammar that helps computers
understand and break down sentences, but with a twist: it uses probabilities to make decisions about
which sentence structure is most likely correct.
Here’s how it works in simple terms:
 Context-Free Grammar (CFG) is like a set of rules that tells you how words can be put together to
form phrases and sentences. For example, a rule might say that a sentence (S) can be made of a noun
phrase (NP) and a verb phrase (VP).
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

Page 44 of
Effective Date Page No.
75

 Probabilistic means that instead of just following one rule, the computer looks at different possible
ways a sentence could be structured and picks the most likely one based on past examples. It assigns a
probability to each possible structure.
 So, a Probabilistic CFG takes into account not just the rules for sentence structure but also the
likelihood of one structure being correct over another. It helps computers handle real-world language
better, where many sentences can have more than one possible meaning or structure.

Example:
For the sentence “The cat sleeps”.
 Rule 1: Sentence (S) → Noun Phrase (NP) + Verb Phrase (VP)
 Rule 2: Noun Phrase (NP) →“The cat”
 Rule 3: Verb Phrase (VP) →“sleeps

PROBABILISTIC LEXICALISED (CFG)


A Probabilistic Lexicalized Context-Free Grammar (PLCFG) is an extension of a standard
Context-Free Grammar (CFG) that incorporates probabilistic and lexicalized information to better
model natural language syntax. It is used in Natural Language Processing (NLP) to represent the
syntactic structure of sentences while addressing the ambiguity and lexical dependency issues
inherent in natural languages.
Here’s a breakdown of the components:

1. Context-Free Grammar (CFG):


CFG is a type of grammar that consists of:

 Terminals: Words or symbols that appear in the final output (e.g., tokens in a sentence).
 Non-terminals: Abstract categories (e.g., NP for noun phrases, VP for verb phrases).
 Rules: Productions of the form A →α, where A is a non-terminal, and α is a sequence of terminals
and/or non-terminals.
 Start Symbol: The initial non-terminal that represents the whole sentence (e.g., S for the sentence).

2. Lexicalized CFG:
In a Lexicalized CFG, the grammar rules are augmented with headwords, which are the most
semantically important words in a phrase. For example:

 Instead of a rule like NP → Det Noun, a lexicalized rule might look like NP(dog) → Det(the)
Noun(dog), where dog is the headword of the noun phrase.

3. Probabilistic Extension:
A Probabilistic CFG (PCFG) assigns probabilities to each production rule. These probabilities
reflect how likely a given rule is used in practice. For example:

 If VP → V NP has a probability of 0.6 and VP → V PP has a probability of 0.4, it reflects that the
former rule is more commonly observed.
Page 45 of
Effective Date Page No.
75
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

In a Probabilistic Lexicalized CFG (PLCFG):

 Each rule is conditioned on lexical (headword) information, allowing the model to incorporate both
structural and lexical dependencies.

For example:

Mathematic

P(NP→DetNoun|dog)=0.8
P(NP→Pronoun|dog)=0.2

Why Use PLCFGs in NLP?

1. Lexical Sensitivity:

By including lexical information, PLCFGs can better model dependencies between words, such as
subject-verb agreement and prepositional phrase attachment.

2. Disambiguation:

Probabilities help resolve ambiguities in parsing. For example, in sentences with multiple possible
parses, the model selects the most probable one.

3. Expressiveness:

PLCFGs handle both syntactic and lexical phenomena in a unified framework.

Applications in NLP:

1. Parsing:

PLCFGs are commonly used in statistical parsers to generate the most likely syntactic tree for a
given sentence.

2. Machine Translation:

Helps in aligning syntactic structures between source and target languages.

3. Language Modeling:

Used for syntactic language models in speech recognition or text generation.


VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

Page 46 of
Effective Date Page No.
75

Challenges:

1. Data Sparsity:

Lexicalized models require large corpora to estimate probabilities accurately.

2. ComputationalComplexity:

Incorporating lexical dependencies increases the size of the grammar and parsing complexity.

3. Headword Selection:

Identifying headwords consistently can be non-trivial.

FEATUR STRUCTURE

A feature structure is a formal way to represent linguistic information using attribute-value pairs.

Example:

[
Category:Noun,
Number:Singular,
Gender:Masculine
]

 They are commonly used in unification-based grammar frameworks like HPSG (Head-driven Phrase
Structure Grammar) and LFG (Lexical-Functional Grammar).

UNIFICATIONOF FEATURE STRUCTURE

 Unification is the process of merging two feature structures by combining their information.
 If two feature structures contain contradictory information, the unification fails.

Example:

Structure 1: [Number:Singular, Gender:Masculine]


Structure 2: [Number:Singular, Case:Nominative]
Unified: [Number:Singular, Gender:Masculine, Case:Nominative]

If there’s a conflict:

Structure 1: [Number:Singular]
Structure 2: [Number:Plural]
Result:Unificationfails.

 Unification is useful for ensuring agreement in linguistic structures (e.g., subject-verb agreement).
Page 47 of
Effective Date Page No.
75
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

---End of Unit 3---

UNIT-4

SEMANTICS AND PRAGMATICS: Requirements for representation, First-Order Logic,


Description Logics – Syntax-Driven Semantic analysis, Semantic attachments – Word Senses,
Relations between Senses, Thematic Roles, selection restrictions – Word Sense Disambiguation,
WSD using Supervised, Dictionary & Thesaurus, Bootstrapping methods – Word Similarity
using Thesaurus and Distributional methods.

Semantics:
In Natural Language Processing (NLP), "semantics" refers to the study of meaning within language,
focusing on how words, phrases, and sentences convey information and their relationships within a
context.

Example:
Sentence: "The bat flew out of the cave.”
"Bat" here refers to a flying mammal, not a baseball bat, because of the context of "cave".
Pragmatics:
 Pragmatics considers how meaning is influenced by the context, including speaker intentions,
assumptions, and the broader discourse.
Example:

Sentence: "Can you pass the salt?"

 Pragmatics: In the context of a dinner table, the pragmatic meaning is that the speaker is requesting
the listener to pass the salt, not asking whether the listener is capable of doing so.
 Implicature: The sentence implies a request, not a literal question about ability.

Requriements for representation:

To represent meaning in NLP, certain requirements must be met:

 Truth-conditional semantics: A representation must capture the conditions under which a sentence is
true or false.
 Context sensitivity: Representations must account for contextual information, such as who is
speaking, when, and where (pragmatics).
 Disambiguation: Representations must handle ambiguity, whether lexical, syntactic, or semantic.
 Scalability: Representations should be efficient to process, especially when handling large corpora or
real-world applications.

Example representation techniques:


 Bag-of-Words (BoW):
 Concept: Treats a document as a collection of words, ignoring word order.
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

Page 48 of
Effective Date Page No.
75

 Example: "The cat sat on the mat" would be represented as a vector with counts for "the", "cat",
"sat", "on", "mat".
 Limitations: Does not capture semantic relationships between words or context.
First –order Logic:

 In Natural Language Processing (NLP), First-Order Logic (FOL) is a way to represent complex
sentences and relationships between entities using predicates, variables, and quantifiers, allowing for
more nuanced analysis than simple propositional logic.
Example:

Sentence: "Some students like to read."


 FOL representation: "∃x (Student(x) ∧Likes(x, Reading))
Description logics:

 Description Logics (DL) is a family of formal logics used to represent knowledge about concepts and
their relationships.
 Syntax: In DL, concepts (classes) and roles (relationships) are combined to form axioms. It uses a set
of constructors to define concepts and relationships.
o Examples:
o Concepts: Person, Doctor, Woman
o Roles: hasChild, worksAs
o Axioms: Doctor ⊆ Person, meaning "All doctors are persons."
 Semantics: Description logic provides a formal way to model the world, using a set of rules and
constraints. This allows for reasoning about relationships between entities and their properties.
 Use in NLP: DL is commonly used in ontologies, taxonomies, and semantic web technologies. It helps
in reasoning over structured knowledge and is used in systems like knowledge graphs and Linked
Data.

Syntax:

 In Natural Language Processing (NLP), "syntax logic" refers to the analysis of how words are arranged
in a sentence to understand its grammatical structure, essentially focusing on the rules governing the
order and relationships between words to interpret.

Example:

 Sentence: "The dog chased the cat."


 Tokenization: ["The", "dog", "chased", "the", "cat"]
 Part-of-speech tagging: ["The" (article), "dog" (noun), "chased" (verb), "the" (article), "cat" (noun)]
 Dependency parsing:
o "chased" (verb) is the root of the sentence.
o "dog" (noun) is the subject of "chased".
o "the" (article) modifies "dog".
Page 49 of
Effective Date Page No.
75
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

o "cat" (noun) is the object of "chased".


1. Introduction to Syntax-Driven Semantic Analysis
 Syntax and Semantics Relationship:

o Syntax refers to the structure of sentences, while semantics deals with the meaning.

o Syntax-driven semantic analysis involves using syntactic structures (like parse trees) to determine the
meaning of sentences.
 Goal:

o To bridge the gap between syntactic structure and the interpretation of sentences, where syntactic rules
guide how words combine to form meaningful phrases.

2. Syntax-driven Semantic Parsing


 Semantic Role Labeling (SRL): It identifies the predicate-argument structure of a sentence (who did
what to whom).
o Example: "John (Agent) kicked the ball (Theme)".

 Dependency Parsing: Syntactic structure that represents the dependency between words (subject-
verb-object relationships).
o Example: In the sentence "John kicked the ball", the word "kicked" is the head, "John" is the subject
(dependent), and "ball" is the object (dependent).
1. Introduction to Semantic Attachments
 Semantic Attachment refers to the process of associating or linking the syntactic structure of a
sentence with its corresponding meaning. It focuses on how meanings are connected to particular parts
of a sentence through syntactic attachments.
 Goal:

o The goal is to link syntactic structures (like parse trees) to semantic representations, allowing systems
to understand the meaning of phrases and sentences as they relate to their structure.

2. Semantic Attachments in Parsing


 Parse Trees:

o A parse tree represents the syntactic structure of a sentence, breaking it down into its constituents.

o In semantic attachment, the challenge is to link each part of the tree with its corresponding meaning.

a. Example of Semantic Attachment


Consider the sentence: "The cat chased the mouse."
 Syntactic Structure:

o S -> NP + VP

o NP -> Det + N

o VP -> V + NP
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

Page 50 of
Effective Date Page No.
75

 Semantic Structure:

o "The cat" -> Agent (who is performing the action)

o "chased" -> Verb (action)

o "the mouse" -> Theme (what is affected by the action)

 Meaning Representation: In this case, the verb "chased" connects the subject "cat" and the object
"mouse." The attachment of semantic roles (Agent, Theme) to the syntactic constituents allows the
system to understand that "the cat" is the agent performing the action on "the mouse."

3. Types of Semantic Attachments


 Semantic Role Labeling (SRL):

o SRL is a crucial component of semantic attachment. It labels the different arguments in a sentence with
their roles, such as Agent, Theme, Goal, etc.
o Example: In the sentence "John (Agent) gave the book (Theme) to Mary (Goal)", SRL identifies and
labels the roles of John, book, and Mary.
 Attachment of Roles to Syntactic Constituents:

o In SRL, syntactic constituents (like noun phrases or verb phrases) are associated with semantic roles.

o For example, in a sentence like "She baked a cake for her friend", the role of Agent is "She," the
Theme is "cake," and the Goal is "her friend."
 Selectional Preferences:

o Words have selectional restrictions that determine what semantic arguments can be attached to them.
For example, the verb "eat" requires a Theme that is typically an edible object, and it requires an
Agent that is a living being.
 Word Sense Disambiguation (WSD):

o Another aspect of semantic attachment is choosing the correct sense of a word based on context.

o Example: "bank" can mean a financial institution or the side of a river. The correct semantic role can
only be attached once the word sense is determined.

4. Techniques for Semantic Attachments


 Combinatory Categorial Grammar (CCG):

o A grammar framework that combines syntax and semantics directly. It is particularly useful for
semantic attachment as it allows lexically grounded meanings to be composed from syntactic
structures.
 Dependency Parsing:

o In dependency parsing, words in a sentence are connected through dependency relations. These
relations can also include semantic roles. For example, in a sentence like "She (Agent) gave a gift
Page 51 of
Effective Date Page No.
75
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

(Theme) to John (Goal).", the verb "gave" has a dependency relation to each of its arguments with the
roles: Agent, Theme, and Goal.
 Lambda Calculus:

o Lambda calculus is often used to express the meaning of syntactic structures. In this case, each
syntactic component is associated with a lambda expression that captures its semantic meaning.
o Example: For the sentence "John gave Mary a book," a lambda expression might look like:

 λx.λy.gave(x, y) for the verb phrase, and similarly for "John" and "Mary."

 Role-Specific Attachments:

o Semantic roles like Agent, Theme, Goal, etc., are explicitly attached to words based on their syntactic
relationships. This allows the sentence structure to "distribute" meaning over its parts.

5. Importance of Semantic Attachments


 Understanding Sentence Meaning: By attaching semantic roles to different parts of a sentence, NLP
systems can determine the overall meaning of a sentence. This is crucial for tasks like machine
translation, question answering, and information extraction.
 Contextual Understanding: By attaching meanings to syntactic components, systems can better
understand the context of a sentence. For example, the same word might have different meanings
depending on the context, and semantic attachments help clarify this.
 Disambiguation: Semantic attachment helps disambiguate sentences with multiple meanings by
assigning specific roles to components in the sentence, such as distinguishing between different senses
of a verb.

6. Challenges in Semantic Attachment


 Ambiguity:

o Syntactic Ambiguity: Sentences can have multiple syntactic structures, leading to different ways of
attaching semantics. For example, in "I saw the man with the telescope", ambiguity arises from
whether "with the telescope" modifies "man" or "saw."
o Semantic Ambiguity: Words and phrases can have multiple meanings depending on the context.
Disambiguating and attaching the correct meaning is complex.
 Complexity of Argument Structure: Some verbs can take multiple arguments with different semantic
roles, such as transitive verbs (e.g., "gave") or ditransitive verbs (e.g., "send").
 World Knowledge: In some cases, attaching the right meaning requires external knowledge about the
world. For example, knowing that a "bank" could mean either a financial institution or a riverbank
requires the system to have contextual or world knowledge.

7. Applications of Semantic Attachments


 Machine Translation (MT):
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

Page 52 of
Effective Date Page No.
75

o MT systems rely on semantic attachments to ensure that the meaning of a sentence is preserved when
translating from one language to another. Without proper attachment of semantic roles, translations
may lose important nuances.
 Information Extraction:

o For extracting meaningful information from texts, systems rely on correctly attaching semantic roles to
identify entities, events, and their relationships in unstructured data (like news articles).
 Question Answering (QA):

o In QA systems, semantic attachments allow the system to identify the correct answer by understanding
the semantic roles of different parts of the question and the text.
 Dialogue Systems:

o In conversational AI or chatbots, semantic attachment helps systems understand the user’s input in
terms of intent, entities, and relationships, leading to more accurate responses.
1. Introduction to Sense Relations
 Word Sense Disambiguation (WSD):

o In NLP, words often have multiple meanings or senses depending on the context in which they appear.
The process of determining which sense of a word is being used is called Word Sense
Disambiguation (WSD).
 Sense Relations:

o Words with multiple senses do not exist in isolation; they often relate to each other. These relations
between senses help NLP systems understand meaning, resolve ambiguity, and capture semantic
similarity.
o Understanding these relationships is key in improving tasks like machine translation, information
retrieval, and sentiment analysis.

2. Types of Sense Relations


Sense relations refer to how the meanings (senses) of words are connected or related. These relationships
are often classified into several categories.
3. Lexical Semantic Networks for Sense Relations
 WordNet:

o One of the most well-known lexical semantic networks, where words are organized into synsets (sets
of synonyms) and linked through various relationships (e.g., synonymy, antonymy, hyponymy).
o WordNet is widely used in WSD, information retrieval, and semantic parsing tasks.

 FrameNet:

o A semantic network based on frames, where words are linked to predefined semantic structures (called
frames). It is particularly useful for understanding semantic roles in sentences.

4. Applications of Sense Relations in NLP


Page 53 of
Effective Date Page No.
75
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

a. Word Sense Disambiguation (WSD)


 Definition: The process of determining the correct sense of a word based on its context.

 Use of Sense Relations: Sense relations like hyponymy, synonymy, and entailment are useful for
disambiguating word senses. For example, knowing that "bank" could be a financial institution or the
side of a river helps a system decide which sense is more likely based on surrounding words.
b. Information Retrieval
 Use of Sense Relations: By recognizing synonymy and hypernymy relations, information retrieval
systems can expand queries to include related terms, improving search accuracy. For example,
searching for "automobile" could also retrieve documents containing "car," thanks to their hyponymic
relationship.
c. Textual Entailment and Question Answering
 Use of Sense Relations: Entailment helps systems understand whether one sentence implies another,
which is critical for tasks like question answering and textual entailment.
d. Semantic Role Labeling
 Use of Sense Relations: The identification of roles such as Agent, Theme, and Goal in sentence
parsing can be aided by knowing the semantic relationships between words.
e. Sentiment Analysis
 Use of Sense Relations: Antonymy helps identify sentiments (positive or negative) in text. For
instance, detecting the opposite meanings of words like "good" and "bad" helps systems understand the
overall sentiment of a sentence.

5. Challenges in Sense Relations


 Polysemy: Many words have more than one meaning, and determining the correct sense can be
difficult without sufficient context.
o Example: "bat" (animal) vs. "bat" (sports equipment). The system must distinguish between these
meanings based on context.
 Contextual Variability: Word meanings can change depending on the sentence, the speaker’s
intention, or the discourse. Identifying the relevant sense requires sophisticated understanding of the
context.
 Lack of Complete Lexical Resources: Although resources like WordNet are valuable, they are not
exhaustive and may lack certain domain-specific senses or relationships.
 Ambiguity in Relations: Some relationships, like synonymy and hyponymy, may not always be clear-
cut. For example, determining whether two words are true synonyms or just related in meaning can be
tricky.

Thematic Roles:
 Definition: Thematic roles (also called theta roles) represent the semantic relationship between a verb
and its arguments (noun phrases) in a sentence. They describe the role that each argument plays in
relation to the action described by the verb.
Key Thematic Roles:
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

Page 54 of
Effective Date Page No.
75

1. Agent:
o The entity that performs or initiates the action or event.

o Example: John broke the vase.

o Explanation: John is the agent because he is the one who performs the action (breaking).

2. Experiencer:
o The entity that perceives or experiences an event or state. The experiencer undergoes an emotional or
sensory perception but doesn't cause the event.
o Example: She heard a noise.

o Explanation: She is the experiencer because she perceives the noise.

3. Theme:
o The entity that is involved in or affected by the action, but it doesn’t actively participate in causing the
event.
o Example: John ate the pizza.

o Explanation: The pizza is the theme because it is affected by the action (being eaten).

4. Goal:
o The destination or endpoint of a movement or action.

o Example: She went to the store.

o Explanation: The store is the goal because it is the destination of her movement.

5. Source:
o The origin or starting point of an action or movement.

o Example: He walked from the park.

o Explanation: The park is the source because it is the starting point of the movement.

6. Instrument:
o The means or tool used to perform the action.

o Example: He cut the paper with a knife.

o Explanation: The knife is the instrument because it is the tool used to perform the action of cutting.

7. Recipient:
o The entity that receives something, typically in actions like giving, sending, or transferring.

o Example: She gave the book to John.

o Explanation: John is the recipient because he receives the book.

8. Patient:
Page 55 of
Effective Date Page No.
75
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

o Similar to the theme, but the patient is the entity that is affected by the action and may undergo a
change of state (often passive).
o Example: The glass broke.

o Explanation: The glass is the patient because it is affected by the breaking event.

Selection Restrictions:
 Definition: Selection restrictions (also known as subcategorization restrictions or argument structure
constraints) are limitations imposed by a verb on the types of arguments it can take. These restrictions
are semantic in nature and ensure that only certain types of noun phrases (NPs) or arguments can
combine with a verb based on the verb's meaning.
Key Points:
 Selection restrictions are important for syntactic and semantic well-formedness in a sentence.

 They are tied to the verb’s meaning and the roles that its arguments must fulfill. Each verb has its
own set of restrictions about what kinds of arguments (in terms of their meaning) it can take.
Types of Selection Restrictions:
1. Argument Type Restrictions:
o Verbs require specific types of noun phrases as their arguments. For example:

 Transitive verbs like eat, buy, and kick require an object (a direct object), which is often a concrete
noun (e.g., apple, car).
 Intransitive verbs like sleep or arrive require a subject but no object.

 Ditransitive verbs like give, send, and show take two objects: a recipient and a theme (e.g., She gave
John the book).
2. Semantic Restrictions:
o Verbs are restricted by the semantic properties of the arguments they take. For example:

 The verb eat requires an argument that is edible (e.g., pizza, apple) and cannot take an argument that is
inanimate or non-edible (e.g., car).
 The verb go requires a place or destination as its argument (e.g., to the park), and it cannot take a
human entity in certain contexts (e.g., to John).
3. Number Restrictions:
o Some verbs impose restrictions on whether their arguments can be singular or plural. For example:

 The verb live can take both singular or plural subjects, but the verb die often requires a singular entity
or a collective group (e.g., The team died).
4. Case Restrictions (in languages with case markings):
o Some verbs impose constraints on the grammatical case of their arguments. For instance, in
languages like German or Russian, the case (nominative, accusative, dative) determines the type of
argument a verb can take.
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

Page 56 of
Effective Date Page No.
75

 Example: In German, the verb helfen (to help) takes a dative object (e.g., Ich helfe dem Mann — I help
the man).
5. Thematic Role Restrictions:
o Certain verbs are constrained by the thematic roles they can assign to their arguments.

 Example: The verb buy typically requires an Agent, a Theme, and a Recipient.

 Example: The verb run typically requires an Agent (the one running), and a Goal (the place where one
is running).
Examples of Selection Restrictions:
1. The man ate an apple:
o The verb ate requires a Theme that is edible. The noun apple satisfies the restriction as it is a type of
food.
2. She kicked the table:
o The verb kick requires an inanimate object that can be physically kicked, so it cannot take something
abstract like "love" or "idea."
3. He gave the book to Sarah:
o The verb give requires two objects: a Recipient (Sarah) and a Theme (the book). You cannot say "He
gave apple to Sarah" unless "apple" is valid based on the context (e.g., if apple was the object being
transferred).
4. She sleeps:
o The verb sleep only takes a subject (Agent/Experiencer) and cannot take an object, because it is
intransitive.
5. The car ran:
o The verb run generally requires an Agent that is capable of moving. Thus, a sentence like "The car
ran" would only work in contexts where run refers to the functioning of the car, and not physical
movement in the usual sense (which would be reserved for an Agent).

Word Sense Disambiguation (WSD):


Definition:
 Word Sense Disambiguation (WSD) is the process of determining which meaning of a word is being
used in a given context when the word has multiple meanings (i.e., polysemy). Many words have
multiple meanings, and WSD helps in choosing the correct meaning based on the surrounding text or
context.
Why WSD is Important:
 Words like bank, bat, bark, etc., can have multiple meanings depending on the context. Properly
identifying the correct sense of a word is essential for accurate machine translation, information
retrieval, text classification, and other Natural Language Processing (NLP) tasks.
Page 57 of
Effective Date Page No.
75
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

Types of Word Sense Disambiguation:


1. Supervised WSD:
o Definition: In supervised WSD, a model is trained on a labeled dataset where words are already
tagged with the correct sense. The system learns to associate the context of the word with the correct
sense from these labeled examples.
o Steps:

1. Collect a labeled corpus (e.g., WordNet annotations, SemCor).


2. Extract features from the surrounding context of the word.
3. Train a machine learning model (e.g., decision trees, SVM, or neural networks) to predict the sense of
the word.
4. Test the model on unseen data to evaluate performance.
o Advantages: High accuracy when large, annotated datasets are available.

o Disadvantages: Requires a large labeled dataset, which can be expensive and time-consuming to
create.
2. Unsupervised WSD:
o Definition: In unsupervised WSD, the system tries to disambiguate the word senses without any
labeled training data. Instead, it uses statistical or heuristic methods to group words into clusters
based on context.
o Methods:

 Clustering-based methods: Group similar word senses based on their context in the corpus.

 Contextual similarity: Compare the context of the ambiguous word with known senses using
measures like cosine similarity or Latent Semantic Analysis (LSA).
o Advantages: No need for labeled data.

o Disadvantages: May be less accurate compared to supervised methods, as it relies purely on statistical
measures and does not have guidance from labeled data.
3. Knowledge-based WSD:
o Definition: Knowledge-based WSD methods use external resources like dictionaries, thesauri, or
semantic networks (e.g., WordNet) to determine the correct word sense. These methods compare the
context of the ambiguous word with definitions or relationships between words.
o Techniques:

 Lesk Algorithm: This algorithm compares the definition of each sense of the word with the definitions
of surrounding words in the context to determine the most appropriate sense.
 Graph-based methods: Use semantic networks (e.g., WordNet) to model the relationships between
words and find the most likely sense.
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

Page 58 of
Effective Date Page No.
75

o Advantages: Uses linguistic resources like WordNet, which are often highly structured and
comprehensive.
o Disadvantages: Limited by the quality and coverage of the external resources; can be computationally
expensive.
Challenges in WSD:
1. Polysemy: A word may have many senses (e.g., "bat" can refer to an animal or a sports implement).
Determining the correct sense requires rich context.
2. Context Sensitivity: The same word may change its meaning based on various linguistic factors like
collocations (words frequently appearing together), syntax, and semantics.
3. Word Sense Granularity: Some words have very fine-grained senses (e.g., synonyms with subtle
differences), which may be difficult to disambiguate.
4. Ambiguity in the Surrounding Context: Sometimes, even with context, it may be hard to pick a
unique sense if the surrounding text is vague or lacking in sufficient information.
Applications of WSD:
 Machine Translation: Accurate word sense identification helps in translating words correctly into
another language.
 Information Retrieval: WSD is used to improve search engine results by identifying the correct
meaning of query terms.
 Text Summarization: Correct word sense helps in producing accurate summaries, as words need to be
interpreted correctly.
 Question Answering Systems: Ensuring the system understands the right meaning of words leads to
better answers.
How Supervised WSN Works in NLP:
1. Data Preparation:
o You need a labeled dataset with words annotated with their canonical senses.

o Example resources: WordNet, SemCor, or manually annotated datasets.

o Each occurrence of the polysemous word in the corpus must be associated with its most frequent or
standard meaning, which is the target for prediction.
2. Feature Extraction:
o In supervised learning, the model learns from the context in which the word appears. The surrounding
words and syntactic structure are important features for prediction.
o Common features include:

 Contextual N-grams: Sequences of words or tokens around the target word.

 Part-of-speech (POS) tags: The grammatical roles of surrounding words.

 Dependency Parsing: Relationships between the words in the sentence (e.g., subject-verb-object).
Page 59 of
Effective Date Page No.
75
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

 Word Embeddings: Pre-trained vector representations (e.g., Word2Vec, GloVe) that capture the
semantic meaning of words.
 WordNet Senses: For models that use resources like WordNet, the features might include the possible
senses or synsets of the word.
3. Model Training:
o Once features are extracted, the system is trained using machine learning algorithms to predict the
correct canonical sense of the word based on its context.
o Supervised algorithms often used for WSN include:

 Support Vector Machines (SVM): Trains a classifier to distinguish between different senses based on
features.
 Decision Trees: Builds a decision structure that classifies words based on context.

 Logistic Regression: Used for simpler models that classify words based on a linear combination of
features.
 Neural Networks: Deep learning models like feedforward neural networks or recurrent neural
networks (RNNs) for context-based prediction.
4. Prediction:
o After training, the model is used to predict the canonical sense of the target word for unseen contexts.

o The model processes the context (surrounding words and syntactic features) and chooses the most
probable canonical sense from the set of possible senses.
5. Evaluation:
o The model’s performance is evaluated on a test set of annotated data.

o Common evaluation metrics for WSN include:

 Accuracy: The percentage of correct sense predictions.

 Precision and Recall: The model’s ability to predict relevant senses.

 F1-Score: The balance between precision and recall.

Supervised WSN Methods:


1. Classification-based Approach:
o Treats WSN as a classification problem where each possible sense of a word is a distinct class.

o For example, in the case of the word "bank", the model learns to classify the context into categories
like financial institution or riverbank based on features.
2. Sequence Labeling Approach:
o In certain tasks like Named Entity Recognition (NER) or other contextual NLP tasks, WSN can be
treated as a sequence labeling problem, where each token in a sequence is assigned a sense label.
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

Page 60 of
Effective Date Page No.
75

o Example: In a sentence like "I went to the bank to deposit money", the word "bank" is labeled with the
financial institution sense.

Advantages of Supervised WSN:


1. High Accuracy: Supervised models can perform well with large, high-quality labeled datasets, as they
learn directly from annotated data.
2. Customizable: Supervised models can be tailored to specific domains or applications, improving the
accuracy of WSN in specialized contexts (e.g., medical, legal).
3. Clear Objective: Since the model is trained with a defined target (the canonical sense), the task has a
clear and objective evaluation metric.

Challenges of Supervised WSN:


1. Labeled Data Dependency: Requires a large annotated dataset where word senses are tagged
correctly. This can be time-consuming and expensive to create.
2. Domain Adaptation: A model trained on general data may not perform well in specific domains (e.g.,
medical texts, legal documents) unless retrained on domain-specific labeled data.
3. Feature Engineering: Supervised models often rely on manually engineered features, which can be
limited and may not capture all the contextual nuances.
4. Overfitting: If the model is too complex, it may memorize the training data (overfitting) and fail to
generalize to new, unseen data.

Applications of Supervised WSN:


1. Text Normalization:
o Used in tasks like text preprocessing to standardize word meanings, especially when words with
multiple meanings appear in varied contexts.
2. Information Retrieval:
o Helps search engines return more relevant results by ensuring that search terms with multiple
meanings are interpreted correctly in context.
3. Machine Translation:
o Ensures that the correct sense of a word is used in translation, which is essential for accurate
translations in polysemous words.
4. Sentiment Analysis:
o Improves sentiment classification by making sure the right meaning of ambiguous words is used in
determining the sentiment of a sentence.

Dictionaries and Thesaurus in NLP:


1. Introduction to Dictionaries and Thesaurus in NLP:
Page 61 of
Effective Date Page No.
75
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

 Dictionaries and thesauruses are essential resources in Natural Language Processing (NLP) as they
provide structured information about words, their meanings, and relationships to other words.
 They are often used to help NLP systems understand word meanings, resolve ambiguities, and enhance
tasks such as machine translation, word sense disambiguation (WSD), and information retrieval.

2. Dictionary in NLP:
A dictionary is a resource that provides information about individual words, including their meanings,
usage, and other properties. In the context of NLP, dictionaries typically include:
 Word Definitions: The meaning of each word (e.g., cat is a small domesticated carnivorous mammal).

 Synonyms: Different words that have the same or similar meanings.

 Antonyms: Words with opposite meanings.

 Part of Speech (POS): The grammatical category to which a word belongs, e.g., noun, verb, adjective.

 Pronunciation: Phonetic transcription of how a word is spoken.

 Morphology: Information about word roots, prefixes, and suffixes (e.g., happy → happiness).

Examples of Dictionaries:
1. WordNet: A lexical database of English words organized into synsets (sets of synonyms). It includes
semantic relationships like hyponymy, hypernymy, meronymy, etc.
2. Oxford Dictionary: A general dictionary with word meanings, etymology, pronunciation, and usage
examples.
3. Merriam-Webster: Another widely used dictionary that includes definitions, synonyms, and usage
information.
Uses of Dictionaries in NLP:
 Word Sense Disambiguation (WSD): Helps identify the correct sense of a word based on context.

 Morphological Analysis: Used to find word roots, prefixes, and suffixes (e.g., running → run).

 POS Tagging: Helps in identifying the grammatical role of a word (e.g., verb, noun).

 Spell Checking and Correction: Identifies correct spelling and suggests alternatives.

3. Thesaurus in NLP:
A thesaurus is a reference resource that lists synonyms and antonyms for words. Unlike a dictionary,
which gives a word's definition and usage, a thesaurus focuses on providing alternative words with
similar meanings or opposite meanings.
Key Features of a Thesaurus:
 Synonyms: Words that have the same or nearly the same meaning (e.g., happy → joyful, content).

 Antonyms: Words with opposite meanings (e.g., happy → sad, joyful → mournful).

 Contextual Usage: Some thesauri may provide guidelines on when to use particular synonyms, based
on connotation or formality.
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

Page 62 of
Effective Date Page No.
75

 Part of Speech: Synonyms and antonyms are typically categorized by the part of speech (e.g.,
adjectives, verbs).
Examples of Thesauri:
1. Roget’s Thesaurus: One of the most famous thesauruses, which organizes words by their meanings
and groups synonyms into categories.
2. WordNet: While primarily a dictionary, it also has a thesaurus-like structure with synonym sets for
each word sense.
Uses of Thesaurus in NLP:
 Text Generation: Helps in paraphrasing and generating alternative phrases.

 Query Expansion: In information retrieval, synonyms can be used to expand search queries to retrieve
more relevant documents.
 Sentiment Analysis: Helps in identifying the sentiment of a sentence by looking at the meaning of
words (e.g., using positive synonyms to assess the tone of the text).
 Machine Translation: Can improve translations by providing alternative words with similar
meanings.

5. Key Differences Between Dictionary and Thesaurus in NLP:


Aspect Dictionary Thesaurus
Provides word definitions and
Focus Provides synonyms and antonyms for words
grammatical info
Helps in finding alternative words
Purpose Helps in understanding word meanings
(synonyms/antonyms)
Meaning, pronunciation, usage,
Content Synonyms, antonyms, context of word usage
etymology
Example Oxford Dictionary, Merriam-Webster Roget's Thesaurus, WordNet

6. Importance of Dictionaries and Thesauruses in NLP:


1. Understanding Word Meaning:
o Both dictionaries and thesauruses help in clarifying word meanings, which is vital for tasks like text
understanding, question answering, and information retrieval.
2. Contextual Analysis:
o A dictionary helps in identifying the most appropriate meaning based on context, whereas a thesaurus
helps by providing semantic alternatives for expanding or rephrasing content.
3. Improving NLP Models:
o Dictionaries and thesauruses serve as external knowledge bases to enhance machine learning models
for tasks like language generation, semantic analysis, and language translation.
4. Facilitating Paraphrasing and Text Generation:
Page 63 of
Effective Date Page No.
75
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

o In text generation and paraphrasing, thesauruses are used to replace words with their synonyms,
producing varied but semantically similar sentences.

Bootstrapping in NLP
1. What is Bootstrapping?
Bootstrapping is a machine learning technique used to improve models through an iterative process,
where the model is trained using its own predictions or the data it generates, often starting with a small
amount of labeled data. It can be used for tasks like text classification, information extraction, and
named entity recognition.
In the context of NLP, bootstrapping is used to automatically generate labeled data or refine models
over time using a small initial seed set.
2. Basic Idea of Bootstrapping:
 Start with a small set of labeled data (or seeds), which can be manually annotated or automatically
identified.
 Use the model trained on this small data set to predict additional data.

 Add the most confident predictions to the labeled dataset.

 Re-train the model on the expanded dataset and repeat the process.

 Over time, the model improves by gradually learning from its predictions, thus boosting its
performance.
3. Bootstrapping Process:
1. Initialization:
o Begin with a small initial seed set or a manually labeled corpus. This seed set is critical because it
gives the model initial guidance.
2. Model Training:
o Use this seed set to train an initial model. This could be a classifier, sequence labeler, or other NLP
models.
3. Prediction:
o Apply the trained model to unlabeled data to make predictions.

o For example, in named entity recognition (NER), the model might identify candidate entities in
unannotated text.
4. Selection of Confident Predictions:
o From the predictions, select the ones the model is most confident about. This could be done using
thresholds (e.g., only adding predictions with high confidence scores).
5. Expanding the Training Set:
o Add these confident predictions to the labeled dataset, essentially augmenting the training set.

6. Re-training:
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

Page 64 of
Effective Date Page No.
75

o Re-train the model using the expanded training set. With more labeled data, the model is likely to
perform better.
7. Repeat:
o Repeat the process iteratively: predict new data, add confident predictions, and retrain. Over time, the
model becomes more robust.

4. Types of Bootstrapping in NLP:


1. Self-training:
o In self-training, a model is trained on a small amount of labeled data and then used to label the
remaining unlabeled data.
o The key idea is that the model can "teach itself" by using its own predictions to expand its training
data.
2. Co-training:
o Co-training involves training two separate models using two different views of the data (e.g., using
different feature sets or representations).
o Each model labels data for the other model. This can be useful when one model’s predictions can
improve the other.
o Example: One model might focus on syntactic features, and the other on semantic features.

3. Multi-instance Bootstrapping:
o In this variant, a set of instances is labeled based on the presence or absence of a label in a group of
related instances.
o For example, in information extraction, bootstrapping can help in learning to extract a concept (e.g.,
named entities) from a set of documents or a collection of related texts.
4. Distant Supervision:
o A variation of bootstrapping where no explicit labels are provided for training, but the system uses
external knowledge sources (e.g., WordNet, Wikipedia) to automatically generate weak labels for
training.

5. Applications of Bootstrapping in NLP:


1. Named Entity Recognition (NER):
o In NER tasks, bootstrapping can be used to automatically learn entity types (e.g., person,
organization) by starting with a small set of labeled entities and then expanding the labeled data
through the model’s predictions.
o Example: If the system starts with a few manually labeled names of organizations (e.g., "Google",
"Microsoft"), it can apply the model to identify more organizations in the text.
2. Information Extraction:
Page 65 of
Effective Date Page No.
75
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

o Bootstrapping can be used to extract relationships or structured information from unstructured data.

o For example, it could identify relationships between people, places, and events by starting with a small
number of known relationships and expanding the model over time.
3. Text Classification:
o In cases where labeled data is sparse, bootstrapping can be used to iteratively expand a dataset to
improve the performance of text classification tasks (e.g., spam detection, sentiment analysis).
4. Sentiment Analysis:
o Bootstrapping can help improve sentiment analysis models by using initial labels (e.g.,
positive/negative sentiment) and expanding the dataset with confident predictions on unlabeled text.
5. Word Sense Disambiguation (WSD):
o Bootstrapping can also aid in WSD tasks, where the goal is to determine the meaning of a word based
on context. Initial training data can help the model resolve ambiguities in words with multiple
meanings.

6. Advantages of Bootstrapping:
1. Reduced Need for Labeled Data:
o One of the biggest advantages of bootstrapping is that it requires very little labeled data to begin
with. This is especially useful in NLP, where obtaining labeled data can be expensive and time-
consuming.
2. Improved Model Performance:
o As the model is trained on more data over time, it can improve accuracy and generalization.

3. Scalability:
o Bootstrapping is scalable as it can work on large amounts of unlabeled data to gradually enhance
performance.
4. Cost-Efficiency:
o By starting with a small labeled dataset, bootstrapping reduces the costs associated with manual
labeling of data.

7. Challenges of Bootstrapping:
1. Error Propagation:
o If the model makes incorrect predictions in the early stages, it may add incorrect data to the training
set, leading to error propagation. This can degrade performance if the model continues to make
mistakes.
2. Initial Bias:
o The initial seed data can bias the system towards certain predictions, and if the seed set is
unrepresentative, this could affect the model’s ability to generalize.
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

Page 66 of
Effective Date Page No.
75

3. Confidence Threshold Selection:


o Deciding how confident the model needs to be before adding a prediction to the training set is a
challenge. Setting this threshold too high can slow the learning process, while setting it too low can
introduce noisy data.
4. Quality of Predictions:
o The success of bootstrapping relies heavily on the quality of initial predictions. If the model is not
confident or accurate enough, it may not effectively expand its training set.

Word Similarity Using Thesaurus in NLP


1. Introduction to Word Similarity:
Word similarity refers to how similar two words are in terms of meaning or usage. It is an important
concept in Natural Language Processing (NLP) as it helps in various tasks such as:
 Information retrieval

 Text classification

 Word sense disambiguation (WSD)

 Machine translation

 Question answering

One way to determine word similarity is by leveraging a thesaurus. A thesaurus provides information
about synonyms (words with similar meanings) and antonyms (words with opposite meanings), which
can be used to measure how closely related two words are in terms of meaning.

2. Using a Thesaurus for Word Similarity:


A thesaurus is a lexical resource that lists words grouped by their meanings. It primarily provides:
 Synonyms: Words with similar meanings.

 Antonyms: Words with opposite meanings.

Key Steps to Calculate Word Similarity Using a Thesaurus:


1. Identify Synonyms:
o If two words share many synonyms, they are considered to be semantically similar. For example,
"happy" and "joyful" are synonyms, so they are likely to be similar.
2. Examine Synonym Sets:
o A thesaurus often groups synonyms into synsets (sets of synonyms). The more overlap two words
have in their synsets, the more similar they are.
o Example: "Car" and "automobile" would belong to the same synset in a thesaurus.

3. Measure Similarity by Synonym Overlap:


o One approach to measuring similarity is to look at the number of shared synonyms. The more
synonyms two words share, the higher their similarity score.
Page 67 of
Effective Date Page No.
75
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

4. Use Thesaurus Relationships:


o Some thesauri, like WordNet, also provide additional relationships between words, such as:

 Hyponymy: One word is a more specific type of another (e.g., "dog" is a hyponym of "animal").

 Hypernymy: One word is a more general term for another (e.g., "animal" is a hypernym of "dog").

 Meronymy: One word refers to a part of something (e.g., "wheel" is a meronym of "car").

5. Contextual Similarity:
o Words with similar meanings but different contexts can be identified by examining the contexts in
which they appear. A thesaurus often includes context examples to help distinguish between different
usages of similar words.
o Example: "Cold" can mean "chilly" (weather) or "unfriendly" (person), and context is necessary to
disambiguate.

Applications of Word Similarity Using Thesaurus:


1. Synonym Replacement in Text Generation:
o Word similarity helps in automatically replacing words with synonyms during text generation or
paraphrasing, making the text more varied and fluent.
2. Information Retrieval:
o Word similarity is used to expand search queries by including synonyms, improving the recall of
relevant documents during information retrieval.
3. Word Sense Disambiguation (WSD):
o Thesaurus-based word similarity can help in resolving ambiguities in word meanings. By identifying
similar meanings of a word in context, we can choose the most appropriate sense.
4. Question Answering:
o In question answering systems, word similarity helps in retrieving answers that use synonyms of the
query terms, improving the system’s ability to handle varied phrasing.
5. Text Classification and Sentiment Analysis:
o Using synonyms to extend the feature space helps in improving the performance of text classifiers or
sentiment analysis models.

Distributional Methods in NLP


1. Introduction to Distributional Methods:
Distributional Methods are a class of techniques used in Natural Language Processing (NLP) that aim
to represent the meaning of words based on their contextual usage. The core idea is that words that
share similar contexts in text are likely to have similar meanings. This approach is rooted in
distributional semantics, a theory that suggests:
 "You shall know a word by the company it keeps."
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

Page 68 of
Effective Date Page No.
75

In other words, the meaning of a word can be inferred from the words that frequently appear around it in
a given corpus of text.

2. Distributional Hypothesis:
The distributional hypothesis posits that:
 Words that occur in similar contexts have similar meanings.

 This hypothesis forms the foundation of distributional semantics and is central to distributional
methods.
Example:
 "cat" and "dog" are often used in similar contexts (e.g., "I have a pet," "The dog is barking," etc.), so
they should have similar meanings.

Applications of Distributional Methods:


1. Word Sense Disambiguation (WSD):
o Distributional methods help disambiguate words with multiple meanings based on their context in a
document.
2. Information Retrieval:
o These methods can improve search engines by allowing the system to retrieve documents based on the
semantic similarity of the query and the document, rather than relying on exact word matching.
3. Text Classification:
o Word embeddings or TF-IDF vectors can be used as features in machine learning models for text
classification tasks (e.g., sentiment analysis, topic categorization).
4. Machine Translation:
o Distributional methods help in aligning words across different languages by comparing the context in
which they appear.
5. Recommendation Systems:
o In recommender systems, distributional representations can be used to recommend similar products or
services based on the textual descriptions of items.

Advantages of Distributional Methods:


 Capturing Context: They are effective in capturing the context of words, which allows for better
understanding of word meaning.
 Flexibility: Distributional methods can be applied to various NLP tasks (e.g., sentiment analysis,
translation, retrieval).
 Scalability: They work well on large-scale datasets and are relatively computationally efficient once
the model (e.g., Word2Vec, GloVe) is trained.
---End of unit 4---
Page 69 of
Effective Date Page No.
75
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

UNIT -05
Discourse Analysis:
Definition:
Discourse analysis (DA) is a qualitative research method that studies language in use, going beyond
individual words or sentences to examine how people communicate in different social contexts.
Focus:
It explores how language structures both texts and social interactions, aiming to understand the
meaning conveyed and the social and cultural contexts that shape communication.
Methods:
Discourse analysts examine various aspects of language, including vocabulary, grammar, gestures,
facial expressions, imagery, and language techniques, to understand how meaning is constructed and
interpreted.
Applications:
DA is used across multiple disciplines, including linguistics, sociology, media studies, history, and
more, to understand the world and how language is used in real life.

Examples:
Analyzing how people talk about a specific topic, what metaphors they use, how they take turns in
conversation, and the patterns of speech.

Lexical Resources:

Definition:
Lexical resources refer to the vocabulary and word choices available to speakers or writers, and how
they are used to express meaning.

Focus:
It examines the range and appropriateness of words and phrases used in a specific context, and how
they contribute to the overall meaning and coherence of the discourse.
Importance:
A strong lexical resource is crucial for effective communication, as it allows speakers and writers to
convey their ideas precisely and persuasively.

Examples:
Using a wide range of vocabulary, choosing appropriate words and phrases for the context, and
avoiding clichés or jargon.

Discourse segmentation:

Discourse segmentation is the process of dividing a text or conversation into meaningful segments or
units, often referred to as Elementary Discourse Units (EDUs), which are then used for further analysis
or processing in tasks like information retrieval, text summarization, and discourse parsing.
Purpose:
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

Page 70 of
Effective Date Page No.
75

Discourse segmentation aims to identify the natural boundaries within a discourse, recognizing how
different parts of a text or conversation relate to each other.
Elementary Discourse Units (EDUs):
These are the basic building blocks of discourse, often considered to be clause-like units that convey a
complete thought or idea.

Applications:

 Discourse Parsing: Understanding the structure and relationships between different parts of a
discourse.
 Information Retrieval: Identifying relevant information within a text or document.
 Text Summarization: Extracting the key information and main ideas from a text.
 Question Answering: Understanding the context and relationships within a text to answer questions
accurately.

 Rule-based systems: Rely on predefined rules and linguistic features to identify discourse segments.
 Neural networks: Use machine learning models to learn patterns and relationships within the text.
Challenges:
 Ambiguity: Determining the boundaries between discourse units can be challenging, as they may not
always be clearly marked by punctuation or other linguistic cues.
 Context dependence: The meaning and structure of a discourse can vary depending on the context.
 Spontaneous speech: Oral conversations, which often have linguistic features foreign to written text,
can be difficult to segment.
Examples:
 "The cat sat on the mat. Then, the dog came in." - The two sentences could be considered two EDUs.
 "However, I think that..." - The word "However" can be a cue that the following sentence is a new
EDU.

Coherence:

In natural language processing (NLP), coherence is a semantic property of a text that refers to the
relationship between sentences and how well they make sense together.
Why is coherence important in NLP?
 Coherence is important for tasks like machine translation and question answering, where NLP systems
need to process entire texts.
 Coherence can be used to assess the quality of writing, such as for automatic essay scoring and
readability assessment.
How is coherence achieved?
 Coherence is achieved through syntactic and semantic features.
 Syntactic features include the use of deictic, anaphoric, and cataphoric elements.
 Semantic features include presuppositions and implications that connect to general world knowledge.
 A text is coherent if it has a clear structure, with paragraphs and sentences that are organized logically.

How is coherence measured?


 Coherence can be measured by assigning a coherence score to a text.
Page 71 of
Effective Date Page No.
75
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

 A text with high coherence is one that is well organized, contains relevant details, and is easy to
understand.

ANAPHORA RESOLUTION USING HOBBS AND CENTERING ALGORITHM by


What is Anaphora?
Definition:
Anaphora, in natural language processing (NLP), refers to the linguistic phenomenon where a word or
phrase (the anaphor) depends semantically on another preceding expression in the text (the antecedent).
This dependency establishes a co reference relationship, where the anaphor and antecedent refer to the
same real-world entity or concept. Understanding anaphora is crucial for tasks like text summarization,
question answering, and machine translation because it helps resolve the meaning of pronouns and other
referring expressions.
Examples:
"The dog barked loudly. It chased its tail." Here, "It" clearly refers to "The dog," representing a simple
case of anaphora resolution. "John met Mary. He gave her a flower." Here, "He" refers to "John" and
"her" refers to "Mary," showcasing multiple anaphoric references within a single sentence. “The cat sat
on the mat. The feline then curled up.” Here, 'feline' is a more complex anaphora, as it's a synonym that
implies the same antecedent
Challenges in Anaphora Resolution
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

Page 72 of
Effective Date Page No.
75

Ambiguity Pronouns can have multiple potential referents, making it difficult to identify the correct
one. Context Dependence Resolution depends on the context, including surrounding words, sentence
structure, and previous mentions. Cross-Sentence References Anaphoric references can span multiple
sentences, requiring complex analysis.
Hobbs Algorithm for Anaphora Resolution Core Principle:
The algorithm systematically searches for the most recent grammatical antecedent that fits the
pronoun's syntactic role. This ensures that the closest and most relevant referent is selected for the
pronoun. Process: The process involves a backward search through the sentence, carefully identifying
potential referents. Syntactic and semantic constraints are then applied to narrow down the possibilities
and select the most appropriate antecedent.
Example:
Consider the sentence “The cat sat on the mat. It purred.” The algorithm would correctly identify “The
cat” as the most recent and suitable antecedent for the pronoun “It”. This highlights the algorithm's
ability to efficiently resolve simple anaphora cases.
Centering Theory for Anaphora Resolution
Central Concept Focuses on shifting discourse focus, tracking important entities in a sentence using
"centers".
Focus Shift
Pronoun resolution is guided by how focus shifts between sentences.
Example
“The cat chased the mouse. It ran away.” The mouse is the focus in the second sentence, making it the
referent for "It"
. Integrating Hobbs and Centering Algorithms
1 Combined Approach
Leveraging both algorithms' strengths to achieve more accurate and robust anaphora resolution.

2 Hybrid Model
Hobbs algorithm provides syntactic constraints, while centering theory captures discourse focus shifts.
3 Improved Accuracy
This integration can address complex cases involving ambiguity and cross-sentence references.
Implementation Considerations
1 Data Preparation Cleaning and annotating training data is crucial for model accuracy.
2 Model Selection choosing a suitable NLP model that supports anaphora resolution capabilities.
3 Optimization Fine-tuning hyper parameters and evaluating performance using appropriate metrics.
Evaluation Metrics and Benchmarks
Precision:
The proportion of correct anaphora resolutions among all predicted resolutions.
Recall:
The proportion of correctly resolved anaphoric references among all actual references.

F1-Score:
The harmonic mean of precision and recall, providing a balanced evaluation metric.
Real-World Applications and Future Directions
Chabot
Understanding conversational context to provide coherent and accurate responses.
Page 73 of
Effective Date Page No.
75
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

Search Engines
Improving search results by correctly identifying the entities referred to in user queries.
Text Summarization
Generating accurate and coherent summaries by resole

Co reference Resolution in NLP


Introduction to Co reference Resolution
Co reference resolution is the task of determining when two or more expressions in text refer to the
same entity.
Importance of Co reference Resolution
- Essential for NLP tasks like machine translation, summarization, and QA.
- Helps in understanding relationships between entities in text.
- Improves contextual understanding in AI models.
Types of Co reference
- Pronominal Co reference: Resolving pronouns to their antecedents.
- Anaphoric Reference: Linking back to earlier mentions.
- Bridging Reference: Implicit relations between entities.

Approaches to Co reference Resolution


- Rule-based methods: Hand-crafted rules.
- Machine Learning: Supervised classifiers.
- Deep Learning: Transformer-based models like BERT and Span BERT.

Challenges in Co reference Resolution


- Ambiguity in pronouns.
- Long-distance dependencies.
- Lack of labeled datasets.
- Complexity in multi-entity contexts.

Applications of Co reference Resolution


- Machine translation
- Chat bots and virtual assistants
- Information extraction
- Text summarization
- Sentiment analysis

Tools and Libraries for Co reference Resolution


- Stanford NLP Core NLP
Spay NeuralCoref
- AllenNLP Co reference Model
- Hugging Face Transformers

Conclusion
Co reference resolution plays a crucial role in NLP by enhancing text comprehension, improving AI
applications, and enabling better contextual analysis.
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

Page 74 of
Effective Date Page No.
75

FRAMENET:
The Frame Net corpus is a lexical database of English that is both human- and machine-readable, based
on annotating examples of how words are used in actual texts.
Frame net represents relationship between the words.
Frame Net is based on a theory of meaning called Frame Semantics, deriving from the work of Charles
J. Fillmore and colleagues.
For example, the concept of cooking typically involves a person doing the cooking (Cook), the food that
is to be cooked (Food), something to hold the food while cooking (Container) and a source of heat
(Heating instrument).
APPLICATION:
Question answering ,Paraphrasing, Textual entailment and Information extraction are some of the
applications of the frame net.
Example:
“John gave Mary a book”. In this, John is the “giver” Mary is the “recipient” and the Book is the “thing
transferred”.
WORDNET:
WorldNet is a large lexical database of English words. It groups words into sets of synonyms,
whichrepresent a concept or meaning. It is the lexical database i.e. dictionary for the English language,
specifically designed for natural language processing.
Sunset is a special kind of a simple interface that is present in NLTK to look up words in Word Net.
Sunset instances are the groupings of synonymous words that express the same concept. Some of the
words have only one Sunset and some have several.
Applications:
Word sense Disambiguation, Text Classification, Question Answering, and Natural Language
Generation are some of the applications of the word net.
PROPBANK:
Prop bank is a corpus of text annotated with semantic roles, which are the roles played by entities in a
sentence.Propbank (proposition bank) is a digital collection of parsed sentences – a Treebank – based on
the Penn Treebank, with other tree banks for languages other than English. The sentences are parsed and
annotated with the semantic roles described in Verb net, another major resource. Each sentence in Prop
bank is linked to a list of its numbered arguments, each with a semantic (thematic) role and selection

restrictions, and labeled with its Verb net class. Prop bank was made primarily for the training of
automatic semantic role labelers through machine learning.
Prop bank Versions and Interfaces:
1. Prop bank 1.0 –Original version of Prop bank and it contains 30,000 sentences.
2. Prop bank 2.0 –An updated version of prop bank which includes annotations of 60,000 sentences.
Brown Corpus:
It is a compiled selection of American English words or text from 500 sources. It contains 1 million
words of American English Texts. It is used for research on language patterns and training NLP models.
The number of sentences in brown corpus is about 52000.
Page 75 of
Effective Date Page No.
75
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

British National Corpus:


It is a 100 million word corpus of text compiled that is a wide range of texts from various genres,
domains and time periods. BNC is a 100-million-word text corpus of samples of written and
spoken English from a wide range of sources. The corpus covers British English of the late 20th century
from a wide variety of genres, with the intention that it be a representative sample of spoken and written
British English of that time. It is used in corpus linguistic for analysis of corpora.
Applications:
1. Language modeling
2. Part of speech tagging
3. Named entity recognition
4. Sentiment analysis
5. Text classification

You might also like