Natural Language
Processing
CSA4006
Dr. Anirban Bhowmick
Assistant Professor
VIT Bhopal
Lecture : 1
Syllabus
Syllabus
3
Module 1:
Introduction: Knowledge in Speech and Language
Processing- Ambiguity- Models and Algorithms-
Language, Thought, and Understanding- The
State of the Art and the Near-Term Future – 4
Regular Expressions-Basic Regular Expression
Patterns- Disjunction, Grouping, and
Precedence- Using an FSA to Recognize
Sheeptalk- Formal
Languages.
Text Books:
Daniel Jurafsky and James H. Martin "Speech and
Language Processing: An Introduction to Natural
Language Processing, Computational Linguistics and
Speech recognition", Prentice Hall, 2nd edition, 2008. 5
Reference Books:
1. Roland R. Hausser "Foundations of Computational
Linguistics: Human- Computer Communication in
Natural Language", Paperback, MIT Press, 2011.
2. Christopher D. Manning and Hinrich Schuetze,
6
"Foundations of Statistical Natural Language
Processing" by MIT Press.
Module 1:
Introduction
Topic: Introduction
NLP
Natural Language Processing (NLP) is a subfield of artificial intelligence (AI) that focuses on the interaction
between computers and humans through natural language. NLP enables computers to understand,
interpret, and generate human language in a way that is both meaningful and valuable.
CSA4006-Dr. Anirban Bhowmick
Why Study NLP?
Ubiquity of Language: Language is a fundamental
medium of communication among humans. NLP allows
machines to understand and process this
communication, enabling a wide range of applications.
Real-World Applications: NLP is used in various real-
world applications such as virtual assistants, sentiment 9
analysis, language translation, information retrieval, and
more.
Data Explosion: The digital age has led to an
explosion of textual data. NLP provides the tools to
extract insights and information from this data.
CSA4006-Dr. Anirban Bhowmick
Brief History of NLP
Early Foundations (1950s-1970s)
1950s: The field of AI is born, and early attempts
at machine translation (MT) using rule-based
systems.
10
1960s: ELIZA, a computer program capable of
simulating human conversation, is developed by
Joseph Weizenbaum.
1970s: Rule-based approaches dominate NLP,
but they struggle with the complexity and ELIZA, a computer program
ambiguity of language.
Note: ELIZA simulated conversation by using a pattern matching
and substitution methodology that gave users an illusion of
understanding on the part of the program
CSA4006-Dr. Anirban Bhowmick
Brief History of NLP
Statistical NLP (1980s-2000s)
1980s: Introduction of statistical methods,
Hidden Markov Models (HMMs), and
probabilistic context-free grammars.
1990s: The use of large corpora and the 11
development of the Penn Treebank
revolutionize NLP. Introduction of part-of-speech
tagging and syntactic parsing.
2000s: More sophisticated statistical models like
Conditional Random Fields (CRFs) and word
embeddings (Word2Vec, GloVe) emerge. Shift
towards data-driven approaches.
CSA4006-Dr. Anirban Bhowmick
Brief History of NLP
Deep Learning and Modern NLP (2010s-Present)
2010s: Deep Learning redefines NLP with neural network
architectures like Recurrent Neural Networks (RNNs) and
Convolutional Neural Networks (CNNs).
2013: Introduction of Word2Vec by Mikolov et al., which learns word
embeddings from large text corpora. 12
2014: "Sequence to Sequence" models enable breakthroughs in
machine translation.
2018: Transformers, exemplified by the BERT model, revolutionize
NLP tasks by learning contextualized word representations.
Present: State-of-the-art models like GPT-3.5 achieve remarkable
performance across a wide range of NLP tasks using massive
amounts of data and computation.
CSA4006-Dr. Anirban Bhowmick
NLP-Rule based
Rule-based Natural Language Processing (NLP) is an approach to language processing that relies on a
set of predefined rules and patterns to analyze and extract information from text data. It contrasts with
machine learning-based NLP, which uses algorithms and models to learn patterns and make predictions
from data
Rule: If a text contains a date in the format
"dd/mm/yyyy" or "dd-mm-yyyy," extract it.
13
Example Text: "The project deadline is 25/09/2023,
and the meeting is scheduled for 30-09-2023."
Rule-Based NLP Output:
Extracted Date: "25/09/2023"
Extracted Date: "30-09-2023"
CSA4006-Dr. Anirban Bhowmick
NLP- Statistical model based
Statistical model-based Natural Language Processing (NLP) relies on the use of statistical techniques
and machine learning algorithms to analyze and understand text data. Unlike rule-based NLP, which relies
on predefined rules and patterns, statistical model-based NLP learns patterns and relationships from data
Task: Text Classification
Statistical Model: Support Vector Machine (SVM)
14
Example: Sentiment Analysis
CSA4006-Dr. Anirban Bhowmick
NLP-Penn Treebank based
The Penn Treebank is a widely used dataset in Natural Language Processing (NLP) that provides
annotated syntactic and structural information for English text. It uses a tree structure to represent the
grammatical and syntactic relationships within sentences. One common application of Penn Treebank-
based NLP is parsing sentences to analyze their grammatical structure
Task: Sentence Parsing Part-of-Speech (POS) Tagging: Each
token is assigned a POS tag that
"The quick brown fox jumps over the lazy dog." represents its grammatical category (e.g., 15
Tokenization: The sentence is first tokenized into noun, verb, adjective). Here is an example
individual words and punctuation marks. In this case, of the sentence with POS tags:
the sentence is tokenized as follows: [("The", "DT"), ("quick", "JJ"), ("brown",
["The", "quick", "brown", "fox", "jumps", "over", "the", "JJ"), ("fox", "NN"), ("jumps",
"lazy", "dog", "."]
CSA4006-Dr. Anirban Bhowmick
NLP-Penn Treebank based
Parsing: The Penn Treebank-based NLP system uses syntactic rules and information to parse the
sentence into a tree structure that represents its grammatical and syntactic relationships. The resulting
parse tree for the example sentence might look like this:
(S
(NP (DT The) (JJ quick) (JJ brown) (NN fox))
(VP (VBZ jumps) (PP (IN over) (NP (DT the) (JJ lazy) (NN dog))))
16
(. .))
In this parse tree, "S" represents the sentence, "NP" represents a noun phrase, "VP" represents a verb
phrase, "DT" represents a determiner, "JJ" represents an adjective, "NN" represents a noun, "VBZ"
represents a verb, and "IN" represents a preposition. The tree structure captures the hierarchical
relationships between the words in the sentence.
CSA4006-Dr. Anirban Bhowmick
NLP-CRFs
Conditional Random Fields (CRFs) are a popular machine learning model used in Natural Language
Processing (NLP) for sequence labeling tasks, such as named entity recognition (NER), part-of-speech
tagging (POS), and chunking. CRFs are particularly effective at capturing dependencies between adjacent
labels in a sequence.
Example Sentence:
"Apple Inc. is headquartered in Cupertino, California."
Label Sequence (NER Tags): 17
["B-ORG", "I-ORG", "O", "O", "B-LOC", "I-LOC", "I-LOC"]
In this example, the labels indicate the following:
"B-ORG": Beginning of an organization name.
"I-ORG": Inside an organization name.
"B-LOC": Beginning of a location name.
"I-LOC": Inside a location name.
"O": Represents words that are not part of any named entity
CSA4006-Dr. Anirban Bhowmick
NLP-State of art
Word2Vec, Sequence-to-Sequence (Seq2Seq), and Transformers are all important techniques in
Natural Language Processing (NLP), but they serve different purposes and have different characteristics.
Let's compare them based on several key aspects:
Parameter Word2vec Seq2Seq Transformers
Objective Used for word Designed for sequence Initially designed for
embedding to sequence tasks, such seq2seq but have 18
as MT, TS become fundamental to
NLP
Model Architecture Shallow NN, uses Encoder and decoder, Uses self attention
CBOW uses RNN or LSTM mechanism and FFNN
Training Large corpus Parallel seq of input and Massive corpora using
target sequence self-supervised and then
fine tuned
Parallelism Inherently parallelizable Less parallelizable Highly parallelizable
CSA4006-Dr. Anirban Bhowmick
Numerical
19
CSA4006-Dr. Anirban Bhowmick
20
EEE1001-Dr. Anirban Bhowmick
Natural Language
Processing
CSA4006
Dr. Anirban Bhowmick
Assistant Professor
VIT Bhopal
Lecture : 2
Syllabus
Syllabus
3
Module 1:
Introduction: Knowledge in Speech and Language
Processing- Ambiguity- Models and Algorithms-
Language, Thought, and Understanding- The
State of the Art and the Near-Term Future – 4
Regular Expressions-Basic Regular Expression
Patterns- Disjunction, Grouping, and
Precedence- Using an FSA to Recognize
Sheeptalk- Formal
Languages.
Text Books:
Daniel Jurafsky and James H. Martin "Speech and
Language Processing: An Introduction to Natural
Language Processing, Computational Linguistics and
Speech recognition", Prentice Hall, 2nd edition, 2008. 5
Reference Books:
1. Roland R. Hausser "Foundations of Computational
Linguistics: Human- Computer Communication in
Natural Language", Paperback, MIT Press, 2011.
2. Christopher D. Manning and Hinrich Schuetze,
6
"Foundations of Statistical Natural Language
Processing" by MIT Press.
Module 1:
Introduction
Topic: Introduction
Applications of NLP
Communication With Machines
CSA4006-Dr. Anirban Bhowmick
Applications of NLP
Conversational Agents Conversational agents contain:
Building AI systems that can engage ● Speech recognition
in natural-sounding conversations ● Language analysis
with users. Used in customer ● Dialogue processing
support, virtual companions, and ● Information retrieval
mental health apps. ● Text to speech
Question Answering Text Generation
Developing systems that can Creating human-like text using
understand and answer questions models like OpenAI's GPT-3.
posed in natural language. Used Applications range from creative
in chatbots, virtual assistants, and writing to chatbots.
information retrieval.
CSA4006-Dr. Anirban Bhowmick
Applications of NLP
Machine Translation Sentiment Analysis
Automatically translating text from Analyzing text to determine the
one language to another. Google sentiment (positive, negative, neutral)
Translate and other translation expressed by the author. Applications
services heavily rely on NLP include brand monitoring, customer
techniques. feedback analysis, and social media
sentiment tracking.
10
Information Retrieval Named Entity Recognition (NER)
Improving search engines by Identifying entities like names, dates,
understanding user queries and locations, and more within a text. Used in
retrieving relevant information from a information extraction, chatbots, and language
large dataset. translation.
CSA4006-Dr. Anirban Bhowmick
Level Of Linguistic Knowledge
1. Phonetics and Phonology
At this level, NLP systems consider the sounds of speech. It involves understanding the
phonemes (distinct speech sounds) and the rules governing their pronunciation, as well as the
intonation patterns and stress in spoken language.
2. Morphology
Morphology deals with the internal structure of words and how they are formed from smaller units
11
called morphemes. Morphological analysis helps in tasks like stemming (reducing words to their
base form) and lemmatization (reducing words to their dictionary form).
3. Syntax
Syntax involves the rules governing the structure of sentences. It includes understanding how
words combine to form phrases and sentences, and the relationships between different parts of
speech. Parsing techniques are used to analyze sentence structure.
CSA4006-Dr. Anirban Bhowmick
Level Of Linguistic Knowledge
4. Semantics
Semantics is the study of meaning in language. NLP systems at this level aim to understand the
meaning of individual words, phrases, and sentences. This can involve tasks like word sense
disambiguation (determining the correct meaning of a word based on context) and semantic role
labeling (identifying the roles of words in a sentence, e.g., subject, object).
12
5. Pragmatics
Pragmatics refers to the use of language in context. It involves understanding implied meaning,
indirect speech acts, and the intentions behind statements. This level is crucial for understanding
sarcasm, irony, and other forms of figurative language.
6. Discourse
Discourse refers to the structure and organization of connected text or speech. NLP systems at this level
consider how sentences relate to each other and form coherent paragraphs or dialogues. Coreference
resolution (identifying which words refer to the same entity) is an important task in discourse analysis.
CSA4006-Dr. Anirban Bhowmick
Why NLP is Hard?
1. Ambiguity
2. Scale
3. Sparsity
4. Variation
13
5. Expressivity
6. Unmodeled Variables
7. Unknown representations
CSA4006-Dr. Anirban Bhowmick
Ambiguity
Ambiguity at multiple levels
Word senses: bank (finance or river ?)
Part of speech: chair (noun or verb ?)
Syntactic structure: I can see a man with a telescope 14
Multiple: I made her duck
Semantic: Time flies like an arrow; fruit flies like a banana
Phonological: I scream, you scream, we all scream for ice cream."
(The words "I scream" and "ice cream
CSA4006-Dr. Anirban Bhowmick
Ambiguity
15
These different meanings are caused by a number of ambiguities.
First, the words duck and her are morphologically or syntactically ambiguous in their part-of-
speech. Duck can be a verb or a noun, while her can be a dative pronoun or a possessive
pronoun. Second, the word make is semantically ambiguous; it can mean create or cook. Finally,
the verb make is syntactically ambiguous in a different way. Make can be transitive, that is, taking
a single direct object, or it can be ditransitive, that is, taking two objects, meaning that the first
object (her) was made into the second object (duck). Finally, make can take a direct object and a
verb, meaning that the object (her) was caused to perform the verbal action (duck). Furthermore, in
a spoken sentence, there is an even deeper kind of ambiguity; the first word could have been eye or
the second word maid.
CSA4006-Dr. Anirban Bhowmick
Ambiguity
We often introduce the models and
algorithms we present throughout the book
as ways to resolve or disambiguate these
ambiguities. For example, deciding whether
duck is a verb or a noun can be solved by
part-of-speech tagging. Deciding whether 16
make means “create” or “cook” can be
solved by word sense disambiguation.
Resolution of part-of-speech and word
sense ambiguities are two important kinds of
lexical disambiguation
Note: Word Sense Disambiguation (WSD) is a natural language
processing (NLP) task that focuses on determining the correct meaning or
sense of a word in a given context.
CSA4006-Dr. Anirban Bhowmick
Scale
Scale in NLP refers to the challenges and opportunities posed by the vast amounts of linguistic data
available for analysis. The scale of data in NLP presents both technical and computational challenges,
but it also enables the development of more sophisticated models and applications.
Challenges of Scale
Data Collection: Gathering and annotating large-scale linguistic data is resource-intensive and time-
consuming. 17
Computational Resources: Processing and analyzing massive datasets require significant
computational power and memory.
Model Complexity: More data often leads to larger and more complex models, which may require
specialized hardware and efficient training techniques.
Noise and Quality: As datasets grow, ensuring data quality becomes crucial, as noise can negatively
impact model performance.
CSA4006-Dr. Anirban Bhowmick
Scale
Opportunities of Scale
Improved Models: Large datasets enable the training
of more accurate and robust NLP models that can
capture subtle linguistic nuances.
Generalization: Models trained on extensive data have
the potential to generalize better across various 18
domains and languages.
Transfer Learning: Pretrained models on massive
datasets can be fine-tuned for specific tasks, reducing
the need for extensive task-specific data.
Multilingualism: Large-scale data allows models to
learn from multiple languages, enabling multilingual
applications.
CSA4006-Dr. Anirban Bhowmick
Sparsity
Sparsity is a common challenge in Natural Language Processing (NLP) that arises due to the vast and
diverse nature of human language. In NLP, sparsity refers to the phenomenon where the data space is
extremely large, but the actual data available for any specific point in that space is very limited. This can
have significant implications for various NLP tasks and models.
Causes of Sparsity in NLP
Vocabulary Size: Natural languages have extensive vocabularies with numerous words, many of which are 19
rare or domain-specific. The majority of words appear infrequently in any given text corpus.
Long Tail Distribution: The frequency distribution of words follows a "long tail" pattern, where a few
common words appear frequently, while the majority of words occur rarely.
Named Entities: Entities like names, locations, dates, and specialized terms are sparse in most text data.
Word Combinations: The number of possible word combinations is astronomically large, but most of these
combinations are never observed in real-world text
CSA4006-Dr. Anirban Bhowmick
20
EEE1001-Dr. Anirban Bhowmick
Natural Language
Processing
CSA4006
Dr. Anirban Bhowmick
Assistant Professor
VIT Bhopal
Lecture : 3
Syllabus
Syllabus
3
Module 1:
Introduction: Knowledge in Speech and Language
Processing- Ambiguity- Models and Algorithms-
Language, Thought, and Understanding- The
State of the Art and the Near-Term Future – 4
Regular Expressions-Basic Regular Expression
Patterns- Disjunction, Grouping, and
Precedence- Using an FSA to Recognize
Sheeptalk- Formal
Languages.
Text Books:
Daniel Jurafsky and James H. Martin "Speech and
Language Processing: An Introduction to Natural
Language Processing, Computational Linguistics and
Speech recognition", Prentice Hall, 2nd edition, 2008. 5
Reference Books:
1. Roland R. Hausser "Foundations of Computational
Linguistics: Human- Computer Communication in
Natural Language", Paperback, MIT Press, 2011.
2. Christopher D. Manning and Hinrich Schuetze,
6
"Foundations of Statistical Natural Language
Processing" by MIT Press.
Module 1:
Introduction
Topic: Regular Expression
Review
CSA4006-Dr. Anirban Bhowmick
Review
CSA4006-Dr. Anirban Bhowmick
Variation
Suppose we train a part of speech tagger or a parser on the Wall Street Journal
10
What will happen if we try to use this tagger/parser for social media?
“ikr smh he asked fir yo last name so he can add u on fb lololol”
CSA4006-Dr. Anirban Bhowmick
POS Tagging
11
CSA4006-Dr. Anirban Bhowmick
Expressivity
Not only can one form have different meanings (ambiguity) but the same meaning can be expressed with
different forms:
12
CSA4006-Dr. Anirban Bhowmick
Unmodeled Variables
13
World knowledge
I dropped the glass on the floor and it broke
I dropped the hammer on the glass and it broke
CSA4006-Dr. Anirban Bhowmick
Unmodeled Representation
Unmodeled representations in NLP refer to aspects of language and meaning that are not fully
captured by existing language models, resulting in situations where models struggle to understand the
nuances and complexities of human communication. Here are some examples of unmodeled
representations:
Example: "She's as busy as a bee." Example: "He's the Einstein of our group."
In this metaphor, the phrase "busy as a bee" implies This expression assumes knowledge about 14
that she is very industrious, but this meaning is not who Einstein was and what he symbolizes.
directly related to bees being busy insects. A model lacking this cultural context might
miss the intended comparison.
Example: "Oh great, another flat tire!"
This statement might be used in a situation where
someone is frustrated about a recurring problem, and
the words imply sarcasm despite the literal words
expressing annoyance.
CSA4006-Dr. Anirban Bhowmick
Factors Changing NLP Landscape
1. Increases in computing power
2. The rise of the web, then the social web
3. Advances in machine learning 15
4. Advances in understanding of language in social
context
CSA4006-Dr. Anirban Bhowmick
Regular Expressions
Regular expressions (regex) are powerful tools used in Natural Language Processing
(NLP) to match and manipulate text patterns. They provide a concise and flexible way to
search, extract, and manipulate textual data.
Imagine you needed to search a string for a term, such as “Phone”.
16
“phone” in “Is the phone here?”
>>> True
Imagine you needed to search a Phone number, “91-98765-43210”, we can do the same:
“91-98765-43210” in “Her phone number is 91-98765-43210”
>>> True
CSA4006-Dr. Anirban Bhowmick
Regular Expression
But if you don’t know the exact number. Or you need to search all the phone
numbers that is there in the text.
We need to use regular expressions to search through the document for
this pattern
17
Regular expressions allow for pattern searching in a text document.
r’\d{2}-\d{5}-\d{5}’
\d = digits have the placeholder pattern code
CSA4006-Dr. Anirban Bhowmick
Regular Expressions: Disjunctions
Letters inside square brackets []
18
Ranges [A-Z]
CSA4006-Dr. Anirban Bhowmick
Regular Expressions: Negation in
Disjunction
Negations [^Ss] Carat means negation only when first in []
19
CSA4006-Dr. Anirban Bhowmick
Regular Expression
Pattern Matches
colou?r Optional previous char color colour
oo*h! 0 or more of previous char oh! ooh! oooh! ooooh!
o+h! 1 or more of previous char oh! ooh! oooh! ooooh!
baa+ baa baaa baaaa baaaaa
20
beg.n any character between beg and n begin begun begun beg3n
Regular Expressions: Anchors ^ $
CSA4006-Dr. Anirban Bhowmick
Advanced Operators
21
A range of numbers can also be specified; so /{n,m}/ specifies from n to m occurrences of the previous
char or expression, while /{n,}/ means at least n occurrences of the previous expression
CSA4006-Dr. Anirban Bhowmick
Error
Find me all instances of the word “the” in a text.
The ----- Misses capitalized
[tT]he ----- Incorrectly returns other or theology
[^a-zA-Z][tT]he[^a-zA-Z] ---matches the correct one
22
The process we just went through was based on fixing In NLP we are always dealing with these
two kinds of errors Matching strings that we should not kinds of errors.
have matched (there, then, other) Reducing the error rate for an application
False positives (Type I) often involves two antagonistic efforts:
Not matching things that we should have matched Increasing accuracy or precision
(The) (minimizing false positives)
False negatives (Type II) Increasing coverage or recall (minimizing
false negatives).
CSA4006-Dr. Anirban Bhowmick
23
EEE1001-Dr. Anirban Bhowmick
Regular Expressions
Lecture 11b
Larry Ruzzo
Outline
• Some string tidbits
• Regular expressions and pattern matching
Strings Again
’abc’
”abc”
a b c
’’’abc’’’
r’abc’
Strings Again
’abc\n’
”abc\n” a b c newline
’’’abc
’’’ }
r’abc\n’
a b c \ n
Why so many?
’ vs ” lets you put the other kind inside
’’’ lets you run across many lines
all 3 let you show “invisible” characters (via \n, \t, etc.)
r’...’ (raw strings) can’t do invisible stuff, but avoid problems
with backslash
open(’C:\new\text.dat’) vs
open(’C:\\new\\text.dat’) vs
open(r’C:\new\text.dat’)
RegExprs are
Widespread
• shell file name patterns (limited)
• unix utility “grep” and relatives
• try “man grep” in terminal window
• perl
• TextWrangler →
• Python
Patterns in Text
• Pattern-matching is frequently useful
• Identifier: A letter followed by >= 0 letters or digits.
count1 number2go, not 4runner
• TATA box: TATxyT where x or y is A
TATAAT TATAgT TATcAT, not TATCCT
• Number: >=1 digit, optional decimal point, exponent.
3.14 6.02E+23, not 127.0.0.1
Regular Expressions
• A language for simple patterns, based on 4 simple
primitives
• match single letters
• this OR that
• this FOLLOWED BY that
• this REPEATED 0 or more times
• A specific syntax (fussy, and varies among pgms...)
• A library of utilities to deal with them
• Key features: Search, replace, dissect
Regular Expressions
• Do you absolutely need them in Python?
• No, everthing they do, you could do yourself
• BUT pattern-matching is widely needed,
tedious and error-prone. RegExprs give you a
flexible, systematic, compact, automatic way to
do it. A common language for specifications.
• In truth, it’s still somewhat error-prone, but in
a different way.
Examples
(details later)
• Identifier: letterfollowed by ≥0 letters or digits.
[a-z][a-z0-9]* i count1 number2go
• TATA box: TATxyT where x or y is A
TAT(A.|.A)T TATAAT TATAgT TATcAT
• Number: one or more digits with optional
decimal point, exponent.
\d+\.?\d*(E[+-]?\d+)? 3.14 6.02E+23
Another Example
Repressed binding sites in regular Python
# assume we have a genome sequence in string variable myDNA
for index in range(0,len(myDNA)-20) :
if (myDNA[index] == "A" or myDNA[index] == "G") and
(myDNA[index+1] == "A" or myDNA[index+1] == "G") and
(myDNA[index+2] == "A" or myDNA[index+2] == "G") and
(myDNA[index+3] == "C") and
(myDNA[index+4] == "C") and
# and on and on!
(myDNA[index+19] == "C" or myDNA[index+19] == "T") :
print "Match found at ",index
break
6
Example
re.findall(r"[AG]{3,3}CATG[TC]{4,4}[AG]{2,2}C[AT]TG[CT][CG][TC]", myDNA)
RegExprs in Python
http://docs.python.org/library/re.html
Simple RegExpr Testing
>>> import re
>>> str1 = 'what foot or hand fell fastest'
>>> re.findall(r'f[a-z]*', str1)
['foot', 'fell', 'fastest'] Definitely
recommend trying
>>> str2 = "I lack e's successor" this with examples
>>> re.findall(r'f[a-z]*',str2) to follow, & more
[]
Returns list of all matching substrings.
Exercise: change it to find strings
starting with f and ending with t
Exercise: In honor of the
winter Olympics, “-ski-ing”
• download & save war_and_peace.txt
• write py program to read it line-by-line, use
re.findall to see whether current line contains
one or more proper names ending in “...ski”;
print each. ['Bolkonski']
['Bolkonski']
['Bolkonski']
• mine begins: ['Bolkonski']
['Bolkonski']
['Razumovski']
['Razumovski']
['Bolkonski']
['Spasski']
...
['Nesvitski', 'Nesvitski']
RegExpr Syntax
They’re strings
Most punctuation is special; needs to be
escaped by backslash (e.g., “\.” instead of “.”) to
get non-special behavior
So, “raw” string literals (r’C:\new\.txt’) are
generally recommended for regexprs
Unless you double your backslashes judiciously
Patterns “Match” Text
Pattern: TAT(A.|.A)T [a-z][a-z0-9]*
Text: RATATaAT TAT! count1
RegExpr Semantics, 1
Characters
RexExprs are patterns; they “match” sequences
of characters
Letters, digits (& escaped punctuation like ‘\.’)
match only themselves, just once
r’TATAAT’ ‘ACGTTATAATGGTATAAT’
RegExpr Semantics, 2
Character Groups
Character groups [abc], [a-zA-Z], [^0-9] also
match single characters, any of the characters
in the group.
Shortcuts (2 of many):
. – (just a dot) matches any letter (except newline)
\s ≡ [ \n\t\r\f\v] (“s” for “space”)
r’T[AG]T[^GC].T’‘ACGTTGTAATGGTATnCT’
Matching one of several alternatives
• Square brackets mean that any of the listed characters will do
• [ab] means either ”a” or ”b”
• You can also give a range:
• [a-d] means ”a” ”b” ”c” or ”d”
• Negation: caret means ”not”
[^a-d] # anything but a, b, c or d
8
RegExpr Semantics, 3:
Concatenation, Or, Grouping
You can group subexpressions with parens
If R, S are RegExprs, then
RS matches the concatenation of strings matched
by R, S individually
R | S matches the union–either R or S
?
r’TAT(A.|.A)T’’TATCATGTATACTCCTATCCT’
RegExpr Semantics, 4
Repetition
If R is a RegExpr, then
R* matches 0 or more consecutive strings
(independently) matching R
R+ 1 or more
R{n} exactly n
R{m,n} any number between m and n, inclusive
R? 0 or 1
Beware precedence (* > concat > |) ?
r’TAT(A.|.A)*T’‘TATCATGTATACTATCACTATT’
RegExprs in Python
By default
Case sensitive, line-oriented (\n treated specially)
Matching is generally “greedy”
Finds longest version of earliest starting match
Next “findall()” match will not overlap
r".+\.py" "Two files: hw3.py and upper.py."
r"\w+\.py" "Two files: hw3.py and UPPER.py."
Exercise 3
Suppose “filenames” are upper or lower case
letters or digits, starting with a letter, followed
by a period (“.”) followed by a 3 character
extension (again alphanumeric). Scan a list of
lines or a file, and print all “filenames” in it,
without their extensions. Hint: use paren
groups.
Solution 3
import sys
import re
filename = sys.argv[1]
filehandle = open(filename,"r")
filecontents = filehandle.read()
myrule = re.compile(
r"([a-zA-Z][a-zA-Z0-9]*)\.[a-zA-Z0-9]{3}")
#Finds skidoo.bar amidst 23skidoo.barber; ok?
match = myrule.findall(filecontents)
print match
Basics of regexp construction
• Letters and numbers match themselves
• Normally case sensitive
• Watch out for punctuation–most of it has special meanings!
7
Wild cards
• ”.” means ”any character”
• If you really mean ”.” you must use a backslash
• WARNING:
– backslash is special in Python strings
– It’s special again in regexps
– This means you need too many backslashes
– We will use ”raw strings” instead
– Raw strings look like r"ATCGGC"
9
Using . and backslash
• To match file names like ”hw3.pdf” and ”hw5.txt”:
hw.\....
10
Zero or more copies
• The asterisk repeats the previous character 0 or more times
• ”ca*t” matches ”ct”, ”cat”, ”caat”, ”caaat” etc.
• The plus sign repeats the previous character 1 or more times
• ”ca+t” matches ”cat”, ”caat” etc. but not ”ct”
11
Repeats
• Braces are a more detailed way to indicate repeats
• A{1,3} means at least one and no more than three A’s
• A{4,4} means exactly four A’s
12
simple testing
>>> import re
>>> string = 'what foot or hand fell fastest'
>>> re.findall(r'f[a-z]*', string)
['foot', 'fell', 'fastest']
Practice problem 1
• Write a regexp that will match any string that starts with ”hum” and
ends with ”001” with any number of characters, including none, in
between
• (Hint: consider both ”.” and ”*”)
13
Practice problem 2
• Write a regexp that will match any Python (.py) file.
• There must be at least one character before the ”.”
• ”.py” is not a legal Python file name
• (Imagine the problems if you imported it!)
14
Using the regexp
First, compile it:
import re
myrule = re.compile(r".+\.py")
print myrule
<_sre.SRE_Pattern object at 0xb7e3e5c0>
The result of compile is a Pattern object which represents your regexp
15
Using the regexp
Next, use it:
mymatch = myrule.search(myDNA)
print mymatch
None
mymatch = myrule.search(someotherDNA)
print mymatch
<_sre.SRE_Match object at 0xb7df9170>
The result of match is a Match object which represents the result.
16
All of these objects! What can they do?
Functions offered by a Pattern object:
• match()–does it match the beginning of my string? Returns None or a
match object
• search()–does it match anywhere in my string? Returns None or a
match object
• findall()–does it match anywhere in my string? Returns a list of
strings (or an empty list)
• Note that findall() does NOT return a Match object!
17
All of these objects! What can they do?
Functions offered by a Match object:
• group()–return the string that matched
group()–the whole string
group(1)–the substring matching 1st parenthesized sub-pattern
group(1,3)–tuple of substrings matching 1st and 3rd parenthesized
sub-patterns
• start()–return the starting position of the match
• end()–return the ending position of the match
• span()–return (start,end) as a tuple
18
A practical example
Does this string contain a legal Python filename?
import re
myrule = re.compile(r".+\.py")
mystring = "This contains two files, hw3.py and uppercase.py."
mymatch = myrule.search(mystring)
print mymatch.group()
This contains two files, hw3.py and uppercase.py
# not what I expected! Why?
19
Matching is greedy
• My regexp matches ”hw3.py”
• Unfortunately it also matches ”This contains two files, hw3.py”
• And it even matches ”This contains two files, hw3.py and uppercase.py”
• Python will choose the longest match
• I could break my file into words first
• Or I could specify that no spaces are allowed in my match
20
A practical example
Does this string contain a legal Python filename?
import re
myrule = re.compile(r"[^ ]+\.py")
mystring = "This contains two files, hw3.py and uppercase.py."
mymatch = myrule.search(mystring)
print mymatch.group()
hw3.py
allmymatches = myrule.findall(mystring)
print allmymatches
[’hw3.py’,’uppercase.py’]
21
Practice problem 3
• Create a regexp which detects legal Microsoft Word file names
• The file name must end with ”.doc” or ”.DOC”
• There must be at least one character before the dot.
• We will assume there are no spaces in the names
• Print out a list of all the legal file names you find
• Test it on testre.txt (on the web site)
22
Practice problem 4
• Create a regexp which detects legal Microsoft Word file names that do
not contain any numerals (0 through 9)
• Print out the start location of the first such filename you encounter
• Test it on testre.txt
23
Practice problem
• Create a regexp which detects legal Microsoft Word file names that do
not contain any numerals (0 through 9)
• Print out the “base name”, i.e., the file name after stripping of the .doc
extension, of each such filename you encounter. Hint: use parenthesized
sub patterns.
• Test it on testre.txt
24
Practice problem 1 solution
Write a regexp that will match any string that starts with ”hum” and ends
with ”001” with any number of characters, including none, in between
myrule = re.compile(r"hum.*001")
25
Practice problem 2 solution
Write a regexp that will match any Python (.py) file.
myrule = re.compile(r".+\.py")
# if you want to find filenames embedded in a bigger
# string, better is:
myrule = re.compile(r"[^ ]+\.py")
# this version does not allow whitespace in file names
26
Practice problem 3 solution
Create a regexp which detects legal Microsoft Word file names, and use it
to make a list of them
import sys
import re
filename = sys.argv[1]
filehandle = open(filename,"r")
filecontents = filehandle.read()
myrule = re.compile(r"[^ ]+\.[dD][oO][cC]")
matchlist = myrule.findall(filecontents)
print matchlist
27
Practice problem 4 solution
Create a regexp which detects legal Microsoft Word file names which do
not contain any numerals, and print the location of the first such filename
you encounter
import sys
import re
filename = sys.argv[1]
filehandle = open(filename,"r")
filecontents = filehandle.read()
myrule = re.compile(r"[^ 0-9]+\.[dD][oO][cC]")
match = myrule.search(filecontents)
print match.start()
28
Regular expressions summary
• The re module lets us use regular expressions
• These are fast ways to search for complicated strings
• They are not essential to using Python, but are very useful
• File format conversion uses them a lot
• Compiling a regexp produces a Pattern object which can then be used
to search
• Searching produces a Match object which can then be asked for
information about the match
29
Natural Language
Processing
CSA4006
Dr. Anirban Bhowmick
Assistant Professor
VIT Bhopal
Lecture : 4
Syllabus
Syllabus
3
Module 1:
Introduction: Knowledge in Speech and Language
Processing- Ambiguity- Models and Algorithms-
Language, Thought, and Understanding- The
State of the Art and the Near-Term Future – 4
Regular Expressions-Basic Regular Expression
Patterns- Disjunction, Grouping, and
Precedence- Using an FSA to Recognize
Sheeptalk- Formal
Languages.
Text Books:
Daniel Jurafsky and James H. Martin "Speech and
Language Processing: An Introduction to Natural
Language Processing, Computational Linguistics and
Speech recognition", Prentice Hall, 2nd edition, 2008. 5
Reference Books:
1. Roland R. Hausser "Foundations of Computational
Linguistics: Human- Computer Communication in
Natural Language", Paperback, MIT Press, 2011.
2. Christopher D. Manning and Hinrich Schuetze,
6
"Foundations of Statistical Natural Language
Processing" by MIT Press.
Module 1:
Introduction
Topic: Regular Expression
Review
CSA4006-Dr. Anirban Bhowmick
Regular Expression
Split the string at every white-space character Split the string at the first white-space character
txt = "The rain in Spain" txt = "The rain in Spain"
x = re.split("\s", txt) x = re.split("\s", txt, 1)
print(x) print(x)
9
['The', 'rain', 'in', 'Spain'] ['The', 'rain in Spain']
Replace all white-space characters with the digit "9"
txt = "The rain in Spain"
x = re.sub("\s", "9", txt)
print(x)
The9rain9in9Spain
CSA4006-Dr. Anirban Bhowmick
html_text = """
Regular Expression <!DOCTYPE html>
<html>
Write a Python program that removes all HTML tags <head>
from an HTML document. Create a function that takes <title>Sample HTML
an HTML string as input and returns the text content Document</title>
without any HTML tags. Use regular expressions to </head>
accomplish this, taking into account different tag <body>
attributes and formats. <h1>Welcome to my website!</h1>
<p>This is a sample HTML
10
document.</p>
</body>
</html>
"""
html_tag_pattern = r'<[^>]*>'
clean_text =
re.sub(html_tag_pattern, '',
html_text)
print(clean_text)
CSA4006-Dr. Anirban Bhowmick
import re
def validate_email_addresses(email_list):
Regular Expression # Regular expression pattern for a valid
email address
email_pattern = r'^[a-zA-Z0-9._%+-]+@[a-
Write a Python program that validates a list of email zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
addresses. Create a function that takes a list of # List to store valid email addresses
email addresses as input and returns a list of valid valid_emails = []
email addresses. Use regular expressions to for email in email_list:
if re.match(email_pattern, email):
validate each email address according to common valid_emails.append(email)
email address patterns. return valid_emails
# Example usage: 11
email_list = [
"
[email protected]",
"
[email protected]",
"invalid-email",
"another@example",
]
valid_emails =
validate_email_addresses(email_list)
print("Valid Email Addresses:")
for email in valid_emails:
print(email)
CSA4006-Dr. Anirban Bhowmick
Finite-state Automata
The regular expression is more than just a convenient
metalanguage for text searching. First, a regular
expression is one way of describing a finite- state
automaton (FSA). Finite-state automata are the
theoretical foundation of a good deal of the
computational work. Any regular expression can be
implemented as a finite-state automaton.
12
Symmetrically, any finite-state automaton can be
described with a regular expression. Second, a
regular expression is one way of characterizing a
particular kind of formal language called a regular
language. Both regular expressions and finite-state
automata can be used to describe regular languages.
A third equivalent method of characterizing the
regular languages, the regular grammar
CSA4006-Dr. Anirban Bhowmick
Finite-state Automata
Finite automata are simple abstract machines used to recognize patterns. Finite automata are also known
as a finite-state machines. It is a mathematical model of a system with discrete inputs, output, states, and a
set of transitions from state to state that occurs on input alphabets symbols. In simple words, we can say It
has a set of states and rules for moving from one state to the next, but it is dependent on the input symbol
used.
Q: Finite set of states represented by vertices. 13
Σ: set of Input Symbols.
𝑞0 : Initial state represented by empty incoming arc.
F: set of Final States represented by double circle.
δ: Transition Function represented by arcs.
CSA4006-Dr. Anirban Bhowmick
14
EEE1001-Dr. Anirban Bhowmick
Natural Language
Processing
CSA4006
Dr. Anirban Bhowmick
Assistant Professor
VIT Bhopal
Lecture : 5
Syllabus
Syllabus
3
Module 1:
Introduction: Knowledge in Speech and Language
Processing- Ambiguity- Models and Algorithms-
Language, Thought, and Understanding- The
State of the Art and the Near-Term Future – 4
Regular Expressions-Basic Regular Expression
Patterns- Disjunction, Grouping, and
Precedence- Using an FSA to Recognize
Sheeptalk- Formal
Languages.
Text Books:
Daniel Jurafsky and James H. Martin "Speech and
Language Processing: An Introduction to Natural
Language Processing, Computational Linguistics and
Speech recognition", Prentice Hall, 2nd edition, 2008. 5
Reference Books:
1. Roland R. Hausser "Foundations of Computational
Linguistics: Human- Computer Communication in
Natural Language", Paperback, MIT Press, 2011.
2. Christopher D. Manning and Hinrich Schuetze,
6
"Foundations of Statistical Natural Language
Processing" by MIT Press.
Module 1:
Introduction
Topic: Regular Expression
Review
CSA4006-Dr. Anirban Bhowmick
Finite-state Automata
The regular expression is more than just a convenient
metalanguage for text searching. First, a regular
expression is one way of describing a finite- state
automaton (FSA). Finite-state automata are the
theoretical foundation of a good deal of the
computational work. Any regular expression can be
implemented as a finite-state automaton.
9
Symmetrically, any finite-state automaton can be
described with a regular expression. Second, a
regular expression is one way of characterizing a
particular kind of formal language called a regular
language. Both regular expressions and finite-state
automata can be used to describe regular languages.
A third equivalent method of characterizing the
regular languages, the regular grammar
CSA4006-Dr. Anirban Bhowmick
Finite-state Automata
Finite automata are simple abstract machines used to recognize patterns. Finite automata are also known
as a finite-state machines. It is a mathematical model of a system with discrete inputs, output, states, and a
set of transitions from state to state that occurs on input alphabets symbols. In simple words, we can say It
has a set of states and rules for moving from one state to the next, but it is dependent on the input symbol
used.
Q: Finite set of states represented by vertices. 10
Σ: set of Input Symbols.
𝑞0 : Initial state represented by empty incoming arc.
F: set of Final States represented by double circle.
δ: Transition Function represented by arcs.
CSA4006-Dr. Anirban Bhowmick
Determinism and Non-Determinism
Deterministic: A Deterministic Finite Automaton (DFA) is a
mathematical model and computational device used to recognize
and accept a set of strings over a finite alphabet. It is a type of
finite state machine characterized by its deterministic nature,
meaning that for each state and input symbol, there is exactly
one defined transition to another state.
11
Non-deterministic: There is a choice of several transitions that
can be taken given a current state and input symbol. (The
machine doesn’t specify how to make the choice.)
Potential solutions:
• Save backup states at each choice point
• Look-ahead in the input before making choice
• Pursue alternatives in parallel
• Determinize our NFSAs (and then minimize)
CSA4006-Dr. Anirban Bhowmick
Using an FSA to Recognize Sheeptalk
Let’s begin with the “sheep language”, the sheep
language as any string from the following (infinite) set:
baa!
baaa!
baaaa!
baaaaa!
baaaaaa! 12
Directed graph with labeled nodes and arc transitions
Five states: q0 the start state, q4 the final state, 5
transitions
CSA4006-Dr. Anirban Bhowmick
Formally
13
State Transition Table for SheepTalk
CSA4006-Dr. Anirban Bhowmick
Recognition and Rejection
The machine starts in the start state (q0), and iterates the following process: Check the next letter of
the input. If it matches the symbol on an arc leaving the current state, then cross that arc, move to the
next state, and also advance one symbol in the input. If we are in the accepting state (q4) when we
run out of input, the machine has successfully recognized an instance of sheeptalk. If the machine
never gets to the final state, either because it runs out of input, or it gets some input that doesn’t match
an arc or if it just happens to get stuck in some non-final state, we say the machine rejects or fails to
accept an input 14
Tape metaphor: a rejected input
CSA4006-Dr. Anirban Bhowmick
D-Recognize
The algorithm is called D-RECOGNIZE for
“deterministic recognizer”. D-RECOGNIZE
begins by setting the variable index to the
beginning of the tape, and current-state to
the machine’s initial state. D-RECOGNIZE
then enters a loop that drives the rest of the
15
algorithm. It first checks whether it has
reached the end of its input. If so, it either
accepts the input (if the current state is an
accept state) or rejects the input
(if not).
CSA4006-Dr. Anirban Bhowmick
D-Recognize
Before examining the beginning of the tape, the machine is in
state q0. Finding a b
on input tape, it changes to state q1 as indicated by the contents
of transition-table[q0,b]. It then finds an a and switches to state q2,
another a puts it in state q3, a third a leaves it in state q3, where it
reads the “!”, and switches to state q4. Since there is no more
input, the End of input condition at the beginning of the loop is
satisfied for the first time and the machine halts in q4. State q4 is 16
an accepting state, and so the machine has accepted the string
baaa! as a sentence in the sheep language. The algorithm will fail
whenever there is no legal transition for a given combination of
state and input. The input abc will fail to be recognized since there
is no legal transition out of state q0 on the input a. Even if the
automaton had allowed an initial a it would have certainly failed on
c, since c isn’t even in the sheeptalk alphabet! We can think of
these “empty” elements in the table as if they all pointed at one
“empty” state, which we might call the fail state or sink state.
CSA4006-Dr. Anirban Bhowmick
Formal Language
A formal language is a set of strings, each string composed of symbols from a finite
symbol-set called an alphabet (the same alphabet used above for defining an
automaton!). The alphabet for the sheep language is the set ∑ = {a,b, !}. Given a model m
(such as a particular FSA), we can use L(m) to mean “the formal language characterized
by m”. So the formal language defined by our sheeptalk automaton m in is the infinite set:
17
The usefulness of an automaton for defining a language is that it can express an infinite
set (such as this one above) in a closed form. Formal languages are not the same as
natural languages, which are the kind of languages that real people speak. In fact, a
formal language may bear no resemblance at all to a real language (e.g., a formal
language can be used to model the different states of a soda machine). But we often use
a formal language to model part of a natural language, such as parts of the phonology,
morphology, or syntax. The term generative grammar is sometimes used in linguistics to
mean a grammar of a formal language; the origin of the term is this use of an automaton
to define a language by generating all possible strings
CSA4006-Dr. Anirban Bhowmick
Another Example
we can also
have a higher level alphabet consisting of words. In this way we can write finite-state automata
that model facts about word combinations. For example, suppose we wanted to build an FSA that
modeled the subpart of English dealing with amounts of money. Such a formal language would
model the subset of English consisting of phrases like ten cents, three dollars, one dollar thirty-five
cents and so on.
18
CSA4006-Dr. Anirban Bhowmick
Example
Fifty one dollars twenty two cents
𝑞0 𝑞1 𝑞2 𝑞4 𝑞5 𝑞6 𝑞7
19
CSA4006-Dr. Anirban Bhowmick
Module 2:
Morphology And Finite-State Transducers:
Inflectional Morphology -Derivational
Morphology- Finite-State Morphological Parsing-
The Lexicon and Morphotactics - Morphological 20
Parsing with Finite-State Transducers-
Combining FST Lexicon and Rules- Lexicon-free
FSTs: The Porter Stemmer- Human
Morphological Processing- Speech Sounds
and Phonetic Transcription- The Phoneme and
Phonological Rules
Introduction
Morphological parsing is a linguistic process that involves breaking down words into their constituent
morphemes. Morphemes are the smallest units of meaning in a language and can be individual words or
meaningful parts of words, such as prefixes, suffixes, and roots. Morphological parsing is an essential
aspect of linguistic analysis, especially in languages with complex inflectional and derivational
morphology, like many Indo-European languages.
It must be able to distinguish between orthographic rules and morphological rules.
21
Orthographic rules are general rules used when breaking a word into its stem and modifiers. An
example would be: singular English words ending with -y, when pluralized, end with -ies. Contrast this
to morphological rules which contain corner cases to these general rules. Both of these types of rules
are used to construct systems that can do morphological parsing
Morphological rules tell us the plural of goose is formed by changing the vowel.
CSA4006-Dr. Anirban Bhowmick
Morphemes
Morphemes: Morphemes are the smallest units of meaning in a language.
For example the word fox consists of a single morpheme (the morpheme fox) while the word cats
consists of two the morpheme cat and the morpheme s
Types of Morpheme:
22
One Morpheme –Nation
Two Morpheme –National (nation, al)
Three Morpheme –Nationalize (nation, al, ize)
Four Morpheme –Denationalize (de, nation, al, ize)
CSA4006-Dr. Anirban Bhowmick
Morphemes
Morphemes can be classified into two main categories:
Free Morphemes (stem): These are complete words that can stand alone and carry meaning on their
own (e.g., "book," "run").
Bound Morphemes (affixes): These are meaningful units that cannot stand alone and must be attached
to a free morpheme to convey meaning. Bound morphemes include prefixes (e.g., "un-" in "undo"),
suffixes (e.g., "-ed" in "walked"), and infixes (inserted inside a word, like in some Tagalog verb forms). 23
Bounded morphemes, when it is added to a morpheme Free Morphemes stand alone as a word
it gives meaning Eg: Girl, cat, dog, little, Book, Bag
s in walks re in replay er in cheaper im in impossible
en in enlighten un in unable
Prefixes: impossible, reply, unhappy, confirm, compress
Suffixes: Passion, Ambition, Unity, Walking
Circumfixes: enlighten, embolden
CSA4006-Dr. Anirban Bhowmick
Concatenative Morphology & Non Concatenative
Morphology
Prefixes and suffixes are often called concatenative morphology since a word is composed of a
number of morphemes concatenated together
Circumfixes (Not in English)
Eg: In German, for example
The past participle of some verbs formed by adding ge to the beginning of the stem and t to the
end
so the past participle of the verb sagen (to say) is gesagt (said). 24
A number of languages have extensive non concatenative morphology, in which morphemes are
combined in more complex ways
Another kind of non concatenative morphology is called templatic morphology or root and pattern
morphology This is very common in Arabic, Hebrew, and other Semitic languages
CSA4006-Dr. Anirban Bhowmick
Non Concatenative Morphology
In Hebrew, for example, a verb is constructed using two components a root, consisting usually of three
consonants ( and carrying the basic meaning, and a template, which gives the ordering of consonants
and vowels and specifies more semantic information about the resulting verb, such as the semantic
voice (e g active, passive, middle)
The Hebrew tri consonantal root lmd meaning ‘learn’ or ‘study’ can be combined with the active voice 25
CaCaC template to produce the word lamad,‘he studied’
The intensive CiCeC template to produce the word limed, ‘he taught’
The intensive passive template CuCaC to produce the word lumad ‘he was taught’
CSA4006-Dr. Anirban Bhowmick
26
EEE1001-Dr. Anirban Bhowmick
Natural Language
Processing
CSA4006
Dr. Anirban Bhowmick
Assistant Professor
VIT Bhopal
Lecture : 6
Syllabus
Syllabus
3
Module 2:
Morphology And Finite-State Transducers:
Inflectional Morphology -Derivational
Morphology- Finite-State Morphological Parsing-
The Lexicon and Morphotactics - Morphological 4
Parsing with Finite-State Transducers-
Combining FST Lexicon and Rules- Lexicon-free
FSTs: The Porter Stemmer- Human
Morphological Processing- Speech Sounds
and Phonetic Transcription- The Phoneme and
Phonological Rules
Text Books:
Daniel Jurafsky and James H. Martin "Speech and
Language Processing: An Introduction to Natural
Language Processing, Computational Linguistics and
Speech recognition", Prentice Hall, 2nd edition, 2008. 5
Reference Books:
1. Roland R. Hausser "Foundations of Computational
Linguistics: Human- Computer Communication in
Natural Language", Paperback, MIT Press, 2011.
2. Christopher D. Manning and Hinrich Schuetze,
6
"Foundations of Statistical Natural Language
Processing" by MIT Press.
Module 1:
Introduction
Topic: Regular Expression
Review
CSA4006-Dr. Anirban Bhowmick
Module 2:
Morphology And Finite-State Transducers:
Inflectional Morphology -Derivational
Morphology- Finite-State Morphological Parsing-
The Lexicon and Morphotactics - Morphological 9
Parsing with Finite-State Transducers-
Combining FST Lexicon and Rules- Lexicon-free
FSTs: The Porter Stemmer- Human
Morphological Processing- Speech Sounds
and Phonetic Transcription- The Phoneme and
Phonological Rules
Introduction
Morphological parsing is a linguistic process that involves breaking down words into their constituent
morphemes. Morphemes are the smallest units of meaning in a language and can be individual words or
meaningful parts of words, such as prefixes, suffixes, and roots. Morphological parsing is an essential
aspect of linguistic analysis, especially in languages with complex inflectional and derivational
morphology, like many Indo-European languages.
It must be able to distinguish between orthographic rules and morphological rules.
10
Orthographic rules are general rules used when breaking a word into its stem and modifiers. An
example would be: singular English words ending with -y, when pluralized, end with -ies. Contrast this
to morphological rules which contain corner cases to these general rules. Both of these types of rules
are used to construct systems that can do morphological parsing
Morphological rules tell us the plural of goose is formed by changing the vowel.
CSA4006-Dr. Anirban Bhowmick
Morphemes
Morphemes: Morphemes are the smallest units of meaning in a language.
For example the word fox consists of a single morpheme (the morpheme fox) while the word cats
consists of two the morpheme cat and the morpheme s
Types of Morpheme:
11
One Morpheme –Nation
Two Morpheme –National (nation, al)
Three Morpheme –Nationalize (nation, al, ize)
Four Morpheme –Denationalize (de, nation, al, ize)
CSA4006-Dr. Anirban Bhowmick
Morphemes
Morphemes can be classified into two main categories:
Free Morphemes (stem): These are complete words that can stand alone and carry meaning on their
own (e.g., "book," "run").
Bound Morphemes (affixes): These are meaningful units that cannot stand alone and must be attached
to a free morpheme to convey meaning. Bound morphemes include prefixes (e.g., "un-" in "undo"),
suffixes (e.g., "-ed" in "walked"), and infixes (inserted inside a word, like in some Tagalog verb forms). 12
Bounded morphemes, when it is added to a morpheme Free Morphemes stand alone as a word
it gives meaning Eg: Girl, cat, dog, little, Book, Bag
s in walks re in replay er in cheaper im in impossible
en in enlighten un in unable
Prefixes: impossible, reply, unhappy, confirm, compress
Suffixes: Passion, Ambition, Unity, Walking
Circumfixes: enlighten, embolden
CSA4006-Dr. Anirban Bhowmick
Concatenative Morphology & Non Concatenative
Morphology
Prefixes and suffixes are often called concatenative morphology since a word is composed of a
number of morphemes concatenated together
Circumfixes (Not in English)
Eg: In German, for example
The past participle of some verbs formed by adding ge to the beginning of the stem and t to the
end
so the past participle of the verb sagen (to say) is gesagt (said). 13
A number of languages have extensive non concatenative morphology, in which morphemes are
combined in more complex ways
Another kind of non concatenative morphology is called templatic morphology or root and pattern
morphology This is very common in Arabic, Hebrew, and other Semitic languages
CSA4006-Dr. Anirban Bhowmick
Non Concatenative Morphology
In Hebrew, for example, a verb is constructed using two components a root, consisting usually of three
consonants ( and carrying the basic meaning, and a template, which gives the ordering of consonants
and vowels and specifies more semantic information about the resulting verb, such as the semantic
voice (e g active, passive, middle)
The Hebrew tri consonantal root lmd meaning ‘learn’ or ‘study’ can be combined with the active voice 14
CaCaC template to produce the word lamad,‘he studied’
The intensive CiCeC template to produce the word limed, ‘he taught’
The intensive passive template CuCaC to produce the word lumad ‘he was taught’
CSA4006-Dr. Anirban Bhowmick
Morphemes
Two broad classes of ways to form words from morphemes:
– Inflection: the combination of a word stem with a grammatical morpheme, usually resulting in a
word of the same class as the original stem, and usually filling some syntactic function like agreement
For example, English has the inflectional morpheme -s for marking the plural on nouns, and the
inflectional morpheme -ed for marking the past tense on verbs.
15
The meaning of the resulting word is easily predictable
– Derivation: the combination of a word stem with a grammatical morpheme, usually resulting in a
word of a different class, often with a meaning hard to predict exactly.
For example the verb computerize can take the derivational suffix -ation to produce the noun
computerization.
CSA4006-Dr. Anirban Bhowmick
Inflection
In English, only nouns, verbs, and sometimes adjectives can be inflected, and the number of affixes
is quite small.
English nouns have only two kinds of inflection: an affix that marks plural and an affix that marks
possessive. For example, many (but not all) English nouns can either appear in the bare stem or
singular form, or take a plural suffix. Here are examples of the regular plural suffix -s (also spelled -es),
and irregular plurals: 16
CSA4006-Dr. Anirban Bhowmick
Inflection
17
The irregular verbs are those that
have some more or less
idiosyncratic forms of inflection
CSA4006-Dr. Anirban Bhowmick
Inflection
An irregular verb can inflect in the past form (also called the preterite) by changing its vowel (eat/ate), or
its vowel and some consonants (catch/caught), or with no ending at all (cut/cut).
The s form is used in the ‘habitual present’ form to distinguish the 3rd person singular ending (She jogs
every Tuesday) from the other choices of person and number (I/you/we/they jog every Tuesday)
The stem form is used in the infinitive form, and also after certain other verbs (I’d rather walk home, I 18
want to walk home)
The ing participle is used when the verb is treated as a noun called a gerund use
Eg: Fishing is fine if you live near water
The ed participle is used in
Perfect construction He’s eaten lunch already) or
Passive construction (The verdict was overturned yesterday
CSA4006-Dr. Anirban Bhowmick
Inflection
A single consonant letter is doubled before adding the ing and ed suffixes (beg/begging/begged)
If the final letter is c the doubling is spelled ck (picnic/picnicking/picnicked)
If the base ends in a silent e it is deleted before adding ing and ed (merge/merging/merged)
Just as for nouns, the s ending is spelled es after verb stems ending in -s (toss/tosses) –z
(waltz/waltzes) -sh (wash/washes) –ch (catch/catches) and sometimes -x (tax/taxes)
Also like nouns, verbs ending in y preceded by a consonant change the y to i (try/tries)
19
CSA4006-Dr. Anirban Bhowmick
Derivation Morphology
Derivation in English is quite complex Its is the Combination of a word stem with a grammatical
morpheme, usually resulting in a word of a different class, often with a meaning hard to predict
exactly
A very common kind of derivation in English is the formation of new nouns, often from verbs or
adjectives This process is called nominalization
20
For example, the suffix -ation produces nouns from verbs ending often in the suffix -ize
(computerize/ computerization)
Adjectives can also be derived from nouns and verbs
CSA4006-Dr. Anirban Bhowmick
Derivation Morphology
Derivation in English is more complex than
inflection because
– Generally less productive
A nominalizing affix like –ation can
not be added to absolutely every verb. eatation(*)
– There are subtle and complex meaning
differences among nominalizing suffixes. 21
For example, sincerity has a subtle
difference in meaning from sincereness.
CSA4006-Dr. Anirban Bhowmick
Morphological parsing
Breaking down words into components and building a
structured representation.
– English:
● cats cat +N +Pl
● caught catch +V +Past
– Spanish:
● vino (came) venir +V + Perf +3P + Sg 22
● vino (wine) vino +N + Masc + Sg
Importance:
Information retrieval
– Normalize verb tenses, plurals, grammar cases
● Machine translation
– Translation based on the stem
CSA4006-Dr. Anirban Bhowmick
Finite States Morphological Parsing
23
Parsing English morphology
CSA4006-Dr. Anirban Bhowmick
Finite States Morphological Parsing
We need at least the following to build a morphological parser:
1. Lexicon: the list of stems and affixes, together with basic information about them (Noun stem or
Verb stem, etc.)
2. Morphotactics: the model of morpheme ordering that explains which classes of morphemes can
follow other classes of morphemes inside a word. E.g., the rule that English plural morpheme follows
the noun rather than preceding it.
3. Orthographic rules: these spelling rules are used to model the changes that occur in a word, 24
usually when two morphemes combine (e.g., the y→ie spelling rule changes city + -s to cities).
CSA4006-Dr. Anirban Bhowmick
The Lexicon and Morphotactic
A lexicon is a repository for words.
– The simplest one would consist of an explicit list of every word of the language. Inconvenient or
impossible!
– Computational lexicons are usually structured with
• a list of each of the stems and
• Affixes of the language together with a representation of morphotactics telling us how they can fit
together. 25
– The most common way of modeling morphotactics is the finite-state automaton.
CSA4006-Dr. Anirban Bhowmick
The Lexicon and Morphotactic
26
CSA4006-Dr. Anirban Bhowmick
27
EEE1001-Dr. Anirban Bhowmick
Natural Language
Processing
CSA4006
Dr. Anirban Bhowmick
Assistant Professor
VIT Bhopal
Lecture : 7
Syllabus
Syllabus
3
Module 2:
Morphology And Finite-State Transducers:
Inflectional Morphology -Derivational
Morphology- Finite-State Morphological Parsing-
The Lexicon and Morphotactics - Morphological 4
Parsing with Finite-State Transducers-
Combining FST Lexicon and Rules- Lexicon-free
FSTs: The Porter Stemmer- Human
Morphological Processing- Speech Sounds
and Phonetic Transcription- The Phoneme and
Phonological Rules
Text Books:
Daniel Jurafsky and James H. Martin "Speech and
Language Processing: An Introduction to Natural
Language Processing, Computational Linguistics and
Speech recognition", Prentice Hall, 2nd edition, 2008. 5
Reference Books:
1. Roland R. Hausser "Foundations of Computational
Linguistics: Human- Computer Communication in
Natural Language", Paperback, MIT Press, 2011.
2. Christopher D. Manning and Hinrich Schuetze,
6
"Foundations of Statistical Natural Language
Processing" by MIT Press.
Module 2:
Morphology
Topic: FSA and FST
Review
CSA4006-Dr. Anirban Bhowmick
Module 2:
Morphology And Finite-State Transducers:
Inflectional Morphology -Derivational
Morphology- Finite-State Morphological Parsing-
The Lexicon and Morphotactics - Morphological 9
Parsing with Finite-State Transducers-
Combining FST Lexicon and Rules- Lexicon-free
FSTs: The Porter Stemmer- Human
Morphological Processing- Speech Sounds
and Phonetic Transcription- The Phoneme and
Phonological Rules
The Lexicon and Morphotactic
English derivational morphology is more complex than English inflectional morphology, and so
automata of modeling English derivation tends to be quite complex
– Some even based on CFG
• A small part of morphosyntactic of English adjectives
10
In a Finite State Automaton (FSA), epsilon (ε) transitions are used to
represent transitions between states without consuming any input symbol.
These transitions are also known as "null" or "empty" transitions. When you
have two states with an epsilon transition between them, it means that you
can move from one state to the other without reading any input symbol.
CSA4006-Dr. Anirban Bhowmick
The Lexicon and Morphotactic
The FSA#1 recognizes all the listed
adjectives, and ungrammatical forms
like unbig, redly, and realest.
• Thus #1 is revised to become #2.
• The complexity is expected from English
derivational.
11
CSA4006-Dr. Anirban Bhowmick
The Lexicon and Morphotactic
We can now use these FSAs to solve the
problem of morphological recognition:
– Determining whether an input string of
letters makes up a legitimate English word or
not
– We do this by taking the morphotactic
FSAs, and plugging in each “sub-lexicon” into 12
the FSA.
– The resulting FSA can then be defined as
the level of the individual letter.
CSA4006-Dr. Anirban Bhowmick
Finite-state transducers (FST)
FST is a type of FSA which maps between two sets of
symbols.
● It is a two-tape automaton that recognizes or
generates pairs of strings, one from each type.
● FST defines relations between sets of strings
13
Given the input, for example, cats, we would like to produce cat +N +PL.
• Two-level morphology, by Koskenniemi (1983)
– Representing a word as a correspondence between a lexical level
• Representing a simple concatenation of morphemes making up a word, and
– The surface level
• Representing the actual spelling of the final word.
• Morphological parsing is implemented by building mapping rules that maps letter sequences like cats on
the surface level into morpheme and features sequence like cat +N +PL on the lexical level.
CSA4006-Dr. Anirban Bhowmick
Finite-state transducers (FST)
The automaton we use for performing the mapping between these two
14
levels is the finite-state transducer or FST.
– A transducer maps between one set of symbols and another;
– An FST does this via a finite automaton.
• Thus an FST can be seen as a two-tape automaton which recognizes or generates pairs of
strings.
• The FST has a more general function than an FSA:
– An FSA defines a formal language
– An FST defines a relation between sets of strings.
• Another view of an FST:
– A machine reads one string and generates another.
CSA4006-Dr. Anirban Bhowmick
FST
FST as recognizer:
– a transducer that takes a pair of strings as input and output accept if the
string-pair is in the string-pair language, and a reject if it is not.
FST as generator:
– a machine that outputs pairs of strings of the language. Thus the output is
a yes or no, and a pair of output strings.
15
FST as transducer:
– A machine that reads a string and outputs another string.
FST as set relater:
– A machine that computes relation between sets.
CSA4006-Dr. Anirban Bhowmick
FST
A formal definition of FST (based on the Mealy machine extension to
a simple FSA):
– Q: a finite set of N states q0, q1,…, qN
– Σ: a finite alphabet of complex symbols. Each complex symbol is
composed of an input-output pair i : o; one symbol I from an input 16
alphabet I, and one symbol o from an output alphabet O, thus Σ ⊆ I×O. I
and O may each also include the epsilon symbol ε.
– q0: the start state
– F: the set of final states, F ⊆ Q
– δ(q, i:o): the transition function or transition matrix between states. Given
a state q ∈ Q and complex symbol i:o ∈ Σ, δ(q, i:o) returns a new state q’ ∈ Q. δ is
thus a relation from Q × Σ to Q.
CSA4006-Dr. Anirban Bhowmick
FST
• FSAs are isomorphic to regular languages, FSTs are isomorphic to regular relations.
• Regular relations are sets of pairs of strings, a natural extension of the regular language,
which are sets of strings.
• FSTs are closed under union, but generally they are not closed under difference,
complementation, and intersection.
• Two useful closure properties of FSTs:
– Inversion: If T maps from I to O, then the inverse of T, 𝑇 −1 maps from O 17
to I.
– Composition: If 𝑇1 is a transducer from 𝐼1 to 𝑂1 and 𝑇2 a transducer from
𝑂1 to 𝑂2 , then 𝑇1 ∙ 𝑇2 maps from 𝐼1 to 𝑂2
• Inversion is useful because it makes it easy to convert a FST-as-parser into an FST-as-
generator.
• Composition is useful because it allows us to take two transducers than run in series and replace
them with one complex transducer.
– 𝑇1 ∙ 𝑇2 (S) = 𝑇2 (𝑇1 (S) )
CSA4006-Dr. Anirban Bhowmick
FST
18
The composition of [a:b]+ with [b:c]+ to produce
[a:c]+.
CSA4006-Dr. Anirban Bhowmick
19
EEE1001-Dr. Anirban Bhowmick
Natural Language
Processing
CSA4006
Dr. Anirban Bhowmick
Assistant Professor
VIT Bhopal
Lecture : 8
Syllabus
Syllabus
3
Module 1:
Introduction: Knowledge in Speech and Language
Processing- Ambiguity- Models and Algorithms-
Language, Thought, and Understanding- The
State of the Art and the Near-Term Future – 4
Regular Expressions-Basic Regular Expression
Patterns- Disjunction, Grouping, and
Precedence- Using an FSA to Recognize
Sheeptalk- Formal
Languages.
Module 2:
Morphology And Finite-State Transducers:
Inflectional Morphology -Derivational
Morphology- Finite-State Morphological Parsing-
The Lexicon and Morphotactics - Morphological 5
Parsing with Finite-State Transducers-
Combining FST Lexicon and Rules- Lexicon-free
FSTs: The Porter Stemmer- Human
Morphological Processing- Speech Sounds
and Phonetic Transcription- The Phoneme and
Phonological Rules
Text Books:
Daniel Jurafsky and James H. Martin "Speech and
Language Processing: An Introduction to Natural
Language Processing, Computational Linguistics and
Speech recognition", Prentice Hall, 2nd edition, 2008. 6
Reference Books:
1. Roland R. Hausser "Foundations of Computational
Linguistics: Human- Computer Communication in
Natural Language", Paperback, MIT Press, 2011.
2. Christopher D. Manning and Hinrich Schuetze,
7
"Foundations of Statistical Natural Language
Processing" by MIT Press.
Module 2:
Morphology
Topic: FSA and FST
FST
FST as recognizer:
– a transducer that takes a pair of strings as input and output accept if the
string-pair is in the string-pair language, and a reject if it is not.
FST as generator:
– a machine that outputs pairs of strings of the language. Thus the output is
a yes or no, and a pair of output strings.
9
FST as transducer:
– A machine that reads a string and outputs another string.
FST as set relater:
– A machine that computes relation between sets.
CSA4006-Dr. Anirban Bhowmick
FST
A formal definition of FST (based on the Mealy machine extension to
a simple FSA):
– Q: a finite set of N states q0, q1,…, qN
– Σ: a finite alphabet of complex symbols. Each complex symbol is
composed of an input-output pair i : o; one symbol I from an input 10
alphabet I, and one symbol o from an output alphabet O, thus Σ ⊆ I×O. I
and O may each also include the epsilon symbol ε.
– q0: the start state
– F: the set of final states, F ⊆ Q
– δ(q, i:o): the transition function or transition matrix between states. Given
a state q ∈ Q and complex symbol i:o ∈ Σ, δ(q, i:o) returns a new state q’ ∈ Q. δ is
thus a relation from Q × Σ to Q.
CSA4006-Dr. Anirban Bhowmick
FST
• FSAs are isomorphic to regular languages, FSTs are isomorphic to regular relations.
• Regular relations are sets of pairs of strings, a natural extension of the regular language, which are
sets of strings.
• FSTs are closed under union, but generally they are not closed under difference, complementation,
and intersection.
• Two useful closure properties of FSTs:
– Inversion: If T maps from I to O, then the inverse of T, 𝑇 −1 maps from O 11
to I.
– Composition: If 𝑇1 is a transducer from 𝐼1 to 𝑂1 and 𝑇2 a transducer from
𝑂1 to 𝑂2 , then 𝑇1 ∙ 𝑇2 maps from 𝐼1 to 𝑂2
• Inversion is useful because it makes it easy to convert a FST-as-parser into an FST-as-
generator.
• Composition is useful because it allows us to take two transducers than run in series and replace
them with one complex transducer.
– 𝑇1 ∙ 𝑇2 (S) = 𝑇2 (𝑇1 (S) )
CSA4006-Dr. Anirban Bhowmick
FST
12
The composition of [a:b]+ with [b:c]+ to produce
[a:c]+.
CSA4006-Dr. Anirban Bhowmick
FST
13
A transducer for English nominal
number inflection 𝑇𝑛𝑢𝑚
CSA4006-Dr. Anirban Bhowmick
FST
14
The transducer 𝑇stems , which maps roots to their root-
class
CSA4006-Dr. Anirban Bhowmick
FST
15
A fleshed out English nominal inflection FST
𝑇lex = 𝑇num ∙ 𝑇stems
CSA4006-Dr. Anirban Bhowmick
Orthographic Rules and FSTs
16
These spelling changes can be thought as taking as input a simple concatenation of morphemes and
producing as output a slightly-modified concatenation of morphemes
CSA4006-Dr. Anirban Bhowmick
Orthographic Rules and FSTs
We note that concatenating the morphemes can work to parse the words like “dog”, “cat”, “fox”, but this
simple method does not work when there is spelling change, like “foxes” is to be parsed into lexicons “fox
+N +PL” or “cats” is to be parsed into “cat +N +PL”, etc. This requires introduction of spelling rules
(also called orthographic rules). To account for the spelling rules, we introduce another tape, called
intermediate tape, which produces the output slightly modified, thus going from 2-level to 3-level
morphology. Such a rule maps from intermediate tape to surface tape. For plural nouns, the rule states,
“insert e on the surface tape just when lexical tape has a morpheme ending in x or z or s and next 17
morpheme is -s”. The examples are ox to oxes, and fox to foxes. The rule is stated as,
The above equation is called Chomsky and Hall notation. A rule of the form a → b/c − d means rewrite
a as b, when it occurs between c and d. Since symbol " is null, replacing it means inserting some thing.
The symbol ∧ indicates morpheme boundary. These boundaries are deleted by including the symbol ∧ :
" in the default pairs for the transducer.
CSA4006-Dr. Anirban Bhowmick
Orthographic Rules and FSTs
● Lexical: foxes +N +Pl
● Intermediate: fox^s# 18
● Surface: foxes
The transducer for the E insertion rule
CSA4006-Dr. Anirban Bhowmick
Combining FST Lexicon and Rules
These multi-level FSTs in sequence between different
tapes, as well as through parallel transducers for
spelling checks, we are able to parse those words
whose morphological analysis is simple.
However, considering the sentence “The police books
the right culprit”, here it is not clear as per above rules
19
that whether the lexical parser’s output is “book +N
+PL” or it Is “book +V +3SG” ! However, to human it is
not difficult to infer that it is the second. This is due to
the ambiguity in the word, which may be a noun or a
verb, depending on its position in a sentence. This type
of ambiguity is called
CSA4006-Dr. Anirban Bhowmick
Combining FST Lexicon and Rules
20
CSA4006-Dr. Anirban Bhowmick
Lexicon-Free FSTs: the Porter Stemmer
• Information retrieval
• One of the mostly widely used stemming algorithms is
the simple and efficient Porter (1980) algorithm, which
is based on a series of simple cascaded rewrite rules.
– ATIONAL → ATE (e.g., relational → relate)
– ING → ε
– if stem contains vowel (e.g., motoring → motor) 21
• Problem:
– Not perfect: error of commission, omission
• Experiments have been made
– Some improvement with smaller documents
– Any improvement is quite small
CSA4006-Dr. Anirban Bhowmick
Introduction
Psychological studies to learn how multi morphemic words are represented in the minds of speakers
of English.
For example, consider the word walk and its inflected forms walks, and walked. Are all three in the
human lexicon? Or merely walk along with -ed and -s?
How about the word happy and its derived forms happily and happiness?
The full listing hypothesis proposes that all words of a language are listed in the mental lexicon
without any internal morphological structure
• Morphological structure is simply an epiphenomenon, and walk, walks, walked, happy, and happily
are all separately listed in the lexicon
The minimum redundancy hypothesis suggests that only the constituent morphemes are
represented in the lexicon, and when processing walks (whether for reading, listening, or talking) we
must access both morphemes (walk and s) and combine them
CSA4006-Dr. Anirban Bhowmick
Introduction
Some of the earliest evidence that the human lexicon represents at least some morphological
structure comes from speech errors
easy enoughly (for “easily enough”)
More recent experimental evidence suggests that neither the full listing nor the minimum redundancy
hypotheses may be completely true. Instead, it’s possible that some but not all morphological
relationships are mentally represented
For eg found that derived forms ( happily) are stored separately from their stem ( but that regularly
inflected forms ( are not distinct in the lexicon from their stems)
Marslen Wilson et al. (1994) found that spoken derived words can prime their stems, but only if the
meaning of the derived form is closely related to the stem.
• For example government primes govern, but department does not prime depart
CSA4006-Dr. Anirban Bhowmick
SPEECH SOUNDS AND PHONETIC TRANSCRIPTION
• The fundamental insights and algorithms necessary to understand modern speech recognition and
speech synthesis technology, and the related branch of linguistics called computational phonology
• Core task – speech recognition acoustic waveform output a string of words
–Text to speech synthesis
Sequence of text words output an acoustic waveform
A speech recognition system needs to have a pronunciation for every word it can recognize, and a
text-to-speech system needs to have a pronunciation for every word it can say
CSA4006-Dr. Anirban Bhowmick
Contd.
• The science of phonetics aims to describe all the sounds of all the world’s languages
– Acoustic phonetics: focuses on the physical properties of the sounds of language
– Auditory phonetics: focuses on how listeners perceive the sounds of language
– Articulatory phonetics: focuses on how the vocal tract produces the sounds of language
Phonetic alphabets: Pronunciation part of the field of phonetics
Articulatory phonetics: Produced by articulators in the mouth
Phonological rules: Systematic way of sounds are differently realized
Computational phonology: Study of computational mechanisms for modeling phonological rules.
Phonological learning: How phonological rules can be automatically induced by machine
learning algorithms.
CSA4006-Dr. Anirban Bhowmick
IPA and ARPABET-vowel
The International Phonetic Alphabet (IPA) and the
ARPABET are two systems used to represent the
sounds of spoken language. They provide a
standardized way to transcribe the sounds of speech,
which can be useful for linguists, phoneticians, and
language learners. Here are examples of both IPA and
ARPABET transcriptions for English words:
CSA4006-Dr. Anirban Bhowmick
IPA and ARPABET-consonant
CSA4006-Dr. Anirban Bhowmick
The Vocal Organs
Articulatory phonetics the study of how phones are produced, as the various organs in the mouth,
throat, and nose modify the airflow from the lungs
Sound is produced by the rapid movement of air
Most sounds in human languages are produced by expelling air from the lungs through the windpipe
(technically the trachea) and then out the mouth or nose
As it passes through the trachea, the air passes through the larynx, commonly known as the Adam’s
apple or voicebox
The larynx contains two small folds of muscle, the vocal folds (often referred to non technically as the
vocal cords) which can be moved together or apart
The space between these two folds is called the glottis
CSA4006-Dr. Anirban Bhowmick
Vocal Organ Most speech sounds are produced by pushing air through the
vocal cords
– Glottis = the opening between the vocal cords
– Larynx = ‘voice box’
– Pharynx = tubular part of the throat above the larynx
– Oral cavity = mouth
– Nasal cavity = nose and the passages connecting it to the
throat and sinuses
Phones are divided into two main classes:
– Consonants are made by restricting or blocking the airflow in
some way, and may be voiced or unvoiced
– Vowels have less obstruction, are usually voiced, and are
generally louder and longer lasting than consonants
• Both kinds of sounds are formed by the motion of air through
the mouth, throat or nose
CSA4006-Dr. Anirban Bhowmick
30
EEE1001-Dr. Anirban Bhowmick
Natural Language
Processing
CSA4006
Dr. Anirban Bhowmick
Assistant Professor
VIT Bhopal
Lecture : 9
Syllabus
Syllabus
3
Module 2:
Morphology And Finite-State Transducers:
Inflectional Morphology -Derivational
Morphology- Finite-State Morphological Parsing-
The Lexicon and Morphotactics - Morphological 4
Parsing with Finite-State Transducers-
Combining FST Lexicon and Rules- Lexicon-free
FSTs: The Porter Stemmer- Human
Morphological Processing- Speech Sounds
and Phonetic Transcription- The Phoneme and
Phonological Rules
Text Books:
Daniel Jurafsky and James H. Martin "Speech and
Language Processing: An Introduction to Natural
Language Processing, Computational Linguistics and
Speech recognition", Prentice Hall, 2nd edition, 2008. 5
Reference Books:
1. Roland R. Hausser "Foundations of Computational
Linguistics: Human- Computer Communication in
Natural Language", Paperback, MIT Press, 2011.
2. Christopher D. Manning and Hinrich Schuetze,
6
"Foundations of Statistical Natural Language
Processing" by MIT Press.
Module 2:
Morphology
Topic: FSA and FST
Vocal Organ Most speech sounds are produced by pushing air through the
vocal cords
– Glottis = the opening between the vocal cords
– Larynx = ‘voice box’
– Pharynx = tubular part of the throat above the larynx
– Oral cavity = mouth
– Nasal cavity = nose and the passages connecting it to the
throat and sinuses
Phones are divided into two main classes:
– Consonants are made by restricting or blocking the airflow in
some way, and may be voiced or unvoiced
– Vowels have less obstruction, are usually voiced, and are
generally louder and longer lasting than consonants
• Both kinds of sounds are formed by the motion of air through
the mouth, throat or nose
CSA4006-Dr. Anirban Bhowmick
Consonants: Place of Articulation
Consonants are sounds produced with some restriction or closure in the vocal tract
• Consonants are classified based in part on where in the vocal tract the airflow is being restricted (the
place of articulation)
• The major places of articulation are bilabial, labiodental, interdental, alveolar, palatal, velar, uvular,
and glottal
CSA4006-Dr. Anirban Bhowmick
Consonants: Place of Articulation
1.Bilabial: The airflow is obstructed by bringing both lips together.
Example: /p/ in "pat," /b/ in "bat," /m/ in "mat."
2.Labiodental: The airflow is obstructed by placing the upper teeth against the lower lip.
Example: /f/ in "fan," /v/ in "van."
3.Interdental: The airflow is obstructed by placing the tip of the tongue between the teeth.
Example: /θ/ in "think," /ð/ in "this."
4.Alveolar: The airflow is obstructed by raising the front part of the tongue to the alveolar ridge, which is the bony
ridge just behind the upper front teeth.
Example: /t/ in "top," /d/ in "dog," /s/ in "sock."
5.Alveopalatal (or Palatoalveolar): The airflow is obstructed by raising the front part of the tongue to the area just
behind the alveolar ridge.
Example: /ʃ/ in "shoe," /ʒ/ in "measure," /tʃ/ in "cheese," /dʒ/ in "judge."
6.Palatal: The airflow is obstructed by raising the middle part of the tongue to the hard palate, which is the roof of the
mouth right behind the alveolar ridge.
Example: /j/ in "yes," /ʎ/ in some dialects of Spanish.
7.Velar: The airflow is obstructed by raising the back part of the tongue to the soft part of the palate (the velum).
Example: /k/ in "cat," /g/ in "go," /ŋ/ in "sing."
8.Glottal: The airflow is obstructed by closing or nearly closing the space between the vocal cords in the larynx.
Example: /h/ in "hat," the glottal stop /ʔ/ in some dialects, as in "uh-oh."
CSA4006-Dr. Anirban Bhowmick
Consonants: Manner of Articulation
Consonants can also be classified by their manner of articulation, which describes how the airflow is
obstructed or modified as they are produced. Here are some common manners of articulation for
consonants with examples:
Plosive (or Stop): These consonants are produced by a complete closure of the vocal tract, causing a
momentary halt in the airflow before releasing it.
Example: /p/ in "pat," /b/ in "bat," /t/ in "top," /d/ in "dog," /k/ in "cat," /g/ in "go.“
Fricative: Fricatives are produced by narrowing the vocal tract, creating turbulent airflow and a continuous,
hissing sound.
Example: /f/ in "fan," /v/ in "van," /s/ in "sock," /z/ in "zebra," /ʃ/ in "shoe," /ʒ/ in "measure."
Affricate: Affricates begin with a stop-like closure and then transition into a fricative sound.
Example: /tʃ/ in "cheese," /dʒ/ in "judge."
CSA4006-Dr. Anirban Bhowmick
Contd.
Nasal: Nasal consonants are produced by lowering the velum (soft part of the roof of the mouth),
allowing air to flow through the nasal cavity.
Example: /m/ in "mat," /n/ in "net," /ŋ/ in "sing."
Liquid: Liquids involve a relatively free airflow, with slight constriction in the vocal tract.
Lateral Liquid: /l/ in "let."
Retroflex Liquid: /ɹ/ in "red" (Note: The pronunciation of this sound can vary regionally.)
Glide (Semivowel): Glides are produced with a slight constriction in the vocal tract but are more
vowel-like in nature.
Example: /j/ in "yes," /w/ in "we."
Approximant: Approximants have a less constricted airflow than fricatives but more than glides.
Example: /ɹ/ in "red" (in some dialects), /ʋ/ in some languages.
These are the main manners of articulation for consonants.
CSA4006-Dr. Anirban Bhowmick
Vowel
Vowels are classified by how high or low the
tongue is, if the tongue is in the front or back of
the mouth, and whether or not the lips are
rounded
High vowels: [i] [ɪ] [u] [ʊ]
Mid vowels: [e] [ɛ] [o] [ə] [ʌ] [ɔ]
Low vowels: [æ] [a]
Front vowels: [i] [ɪ] [e] [ɛ] [æ]
Central vowels: [ə] [ʌ]
Back vowels: [u] [ɔ] [o] [æ] [a]
CSA4006-Dr. Anirban Bhowmick
14
EEE1001-Dr. Anirban Bhowmick
Natural Language
Processing
CSA4006
Dr. Anirban Bhowmick
Assistant Professor
VIT Bhopal
Lecture : 10
Syllabus
Syllabus
3
Module 3:
Syntax Parsing: Tagsets for English - Part of
Speech Tagging- Rule based Part-of-speech
Tagging- Stochastic Part-of speech Tagging-
4
Transformation-Based Tagging- Context-Free
Grammars for English - Context-Free Rules and
Trees- The Noun Phrase. The Verb Phrase and
Subcategorization- Grammar Equivalence
&Normal Form- Finite State & Context-Free
Grammars.
Text Books:
Daniel Jurafsky and James H. Martin "Speech and
Language Processing: An Introduction to Natural
Language Processing, Computational Linguistics and
Speech recognition", Prentice Hall, 2nd edition, 2008. 5
Reference Books:
1. Roland R. Hausser "Foundations of Computational
Linguistics: Human- Computer Communication in
Natural Language", Paperback, MIT Press, 2011.
2. Christopher D. Manning and Hinrich Schuetze,
6
"Foundations of Statistical Natural Language
Processing" by MIT Press.
Module 3: Syntax
Parsing
Topic: Introduction
Tagsets for English
There are a small number of popular tagsets for English
The choice of Tagsets depends on the nature of the
application
Small Tagsets (more general)
Large Tagsets (finer tags)
8
Some of the widely used Part of Speech Tagsets are
45 - tag Penn Treebank tagset
87 - tag tagset used for the Brown corpus
medium sized 61 tag C5 tagset
146 tag C7 tagset
Each word is associated with the tag from the used
Tagsets.
CSA4006-Dr. Anirban Bhowmick
Some Common Tagsets (English)
Penn Treebank Tagset:
This is one of the most widely used tagsets in natural language processing and linguistics. It
was developed for the Penn Treebank project and includes tags like NN (Noun), VB (Verb),
JJ (Adjective), RB (Adverb), and more. It's known for its detailed granularity.
Universal POS Tagset:
The Universal POS Tagset is designed to be more cross-linguistic and universal, making it
easier to work with multilingual data. It includes tags like NOUN, VERB, ADJ, ADV, and 9
others, providing a simpler and more consistent set of labels compared to the Penn Treebank
Tagset.
Brown Corpus Tagset:
The Brown Corpus is a well-known linguistic corpus, and it has its own tagset. It includes tags
like N (Noun), V (Verb), ADJ (Adjective), and others. It's used primarily for linguistic research.
CLAWS Tagset:
The CLAWS (Constituent Likelihood Automatic Word-tagging System) Tagset is designed to
be a detailed and linguistically motivated tagset. It includes a wide range of tags to capture
grammatical and syntactic information.
CSA4006-Dr. Anirban Bhowmick
Some Common Tagsets (English)
Lancaster-Oslo/Bergen (LOB) Tagset:
The LOB Corpus is another linguistic corpus, and it has its own tagset. It's used primarily in
corpus linguistics and includes tags like NN (Noun), VB (Verb), JJ (Adjective), and more.
Medical Subject Headings (MeSH) Tagset:
This tagset is specific to the medical domain and is used for indexing and categorizing
medical texts. It includes tags like A1.4.1 (Anatomy), D1.1.1 (Diseases), and others.
10
OntoNotes Tagset:
The OntoNotes project developed a tagset for annotating a wide range of linguistic
information, including part of speech, named entities, and syntactic structures. It's used in
various natural language processing tasks.
Google Universal Dependencies Tagset:
Google's Universal Dependencies project aims to provide universal grammatical relations
and dependency labels for multiple languages, including English. It includes tags like NOUN,
VERB, ADJ, and more, similar to the Universal POS Tagset.
CSA4006-Dr. Anirban Bhowmick
POS tagging
Part of speech tagging (or just tagging for short) is the process of assigning a part of
speech or other lexical class marker to each word in a corpus
Tags are also usually applied to punctuation markers thus tagging for natural language is
the same process as tokenization for computer languages, although tags for natural
languages are much more ambiguous
Even in these simple examples, automatically
11
assigning a tag to each word is not trivial.
For example, book is ambiguous more than one
possible usage and part of speech.
It can be a verb (as in book that flight or to book
the suspect) or a noun (as in hand me that book,
or a book of matches).
Similarly that can be a determiner (as in Does
that flight serve dinner), or a complementizer (as
in I thought that your flight was earlier).
CSA4006-Dr. Anirban Bhowmick
Multiple POS
Words often have more than one POS: back
The back door = JJ
On my back = NN
Olga’s not looking forward to going back to school in
September.= RB
12
Promised to back the bill = VB
The POS tagging problem is to determine the POS tag for
a particular instance of a word.
The problem of POS tagging is to resolve these ambiguities, choosing the proper tag
for the context.
Part of speech tagging is thus one of the many disambiguation tasks
CSA4006-Dr. Anirban Bhowmick
How Hard is POS Tagging? Measuring
Ambiguity
T
13
CSA4006-Dr. Anirban Bhowmick
Methods for POS Tagging
Rule based tagging uses hand written rules
ENGTWOL (ENGlish TWO Level analysis)
Stochastic Probabilistic sequence models
HMM (Hidden Markov Model) tagging
MEMMs (Maximum Entropy Markov Models)
14
Transformation Based Tagging uses ruled learned automatically.
CSA4006-Dr. Anirban Bhowmick
Rule based POS tagging
First stage used a dictionary to assign each word a list of potential parts of
speech
Second stage used large lists of hand written disambiguation rules to winnow
down this list to a single part of speech for each word
15
These taggers are knowledge-driven taggers.
The rules in Rule-based POS tagging are built manually.
The information is coded in the form of rules.
We have some limited number of rules approximately around 1000.
Smoothing and language modeling is defined explicitly in rule-based taggers.
CSA4006-Dr. Anirban Bhowmick
Rule based POS tagging
Rule-based POS taggers can be relatively simple to implement and are often used as a starting point for
more complex machine learning-based taggers. However, they can be less accurate and less efficient than
machine learning-based taggers, especially for tasks with large or complex datasets
Here is an example of how a rule-based POS tagger might work:
Define a set of rules for assigning POS tags to words. For example:
If the word ends in “-tion,” assign the tag “noun.” 16
If the word ends in “-ment,” assign the tag “noun.”
If the word is all uppercase, assign the tag “proper noun.”
If the word is a verb ending in “-ing,” assign the tag “verb.”
iterate through the words in the text and apply the rules to each word in turn. For example:
“Nation” would be tagged as “noun” based on the first rule.
“Investment” would be tagged as “noun” based on the second rule.
“UNITED” would be tagged as “proper noun” based on the third rule.
“Running” would be tagged as “verb” based on the fourth rule.
CSA4006-Dr. Anirban Bhowmick
ENGTWOL –rule based tagger
Uses two level lexicon transducer
Uses hand crafted rules (about 1100 rules)
Process: Start With a Dictionary
17
CSA4006-Dr. Anirban Bhowmick
ENGTWOL –rule based tagger
Eliminate VBN if VBD is an option when
VBN|VBD follows “<start> PRP”
18
CSA4006-Dr. Anirban Bhowmick
Stochastic Probabilistic sequence models
HMM (Hidden Markov Model) is a Stochastic technique for POS tagging. Hidden Markov
models are known for their applications to reinforcement learning and temporal pattern
recognition such as speech, handwriting, gesture recognition, musical score following, partial
discharges, and bioinformatics.
Let us consider an example proposed by Dr.Luis
Serrano and find out how HMM selects an appropriate
19
tag sequence for a sentence.
Process:
Training Data:
Mary Jane can see Will
Spot will see Mary
Will Jane spot Mary?
Mary will pat Spot
CSA4006-Dr. Anirban Bhowmick
HMM-POS Tagging
Words Noun Modal Verb
Mary 4 0 0
Jane 2 0 0
Will 1 3 0
Spot 2 0 1
Can 0 1 0 20
See 0 0 2
pat 0 0 1
CSA4006-Dr. Anirban Bhowmick
HMM-POS Tagging
Now let us divide each column by the total number of their appearances for example, ‘noun’ appears nine
times in the above sentences so divide each term by 9 in the noun column. We get the following table after
this operation.
Words Noun Model Verb
Mary 4/9 0 0
Jane 2/9 0 0
21
Will 1/9 3/4 0
Spot 2/9 0 1/4
Can 0 1/4 0
See 0 0 2/4
pat 0 0 1
These are the emission probabilities.
CSA4006-Dr. Anirban Bhowmick
HMM-POS Tagging
N M V <E>
Next, we have to calculate the transition probabilities,
so define two more tags <S> and <E>. <S> is placed at <S> 3 1 0 0
the beginning of each sentence and <E> at the end as N 1 3 1 4
shown in the figure below. M 1 0 3 0
V 4 0 0 0
In the above figure, we can see that the <S> tag is 22
followed by the N tag three times, thus the first entry is
3.The modal tag follows the <S> just once, thus the
second entry is 1. In a similar manner, the rest of the
table is filled.
Next, we divide each term in a row of the table by the
total number of co-occurrences of the tag in
consideration, for example, The Model tag is followed
by any other tag four times as shown below, thus we
divide each element in the third row by four.
CSA4006-Dr. Anirban Bhowmick
HMM-POS Tagging
N M V <E>
<S> 3/4 1/4 0 0
N 1/9 3/9 1/9 4/9
M 1/4 0 3/4 0
V 4/4 0 0 0 23
CSA4006-Dr. Anirban Bhowmick
HMM-POS Tagging
Take a new sentence and tag them with wrong tags.
Let the sentence, ‘ Will can spot Mary’ be tagged
as-
Will as a modal
Can as a verb
Spot as a noun
24
Mary as a noun
Now calculate the probability of this sequence being
correct in the following manner.
The probability of the tag Model (M) comes after the tag <S> is ¼ as seen in the table. Also, the
probability that the word Will is a Model is 3/4. In the same manner, we calculate each and every
probability in the graph. Now the product of these probabilities is the likelihood that this sequence is
right. Since the tags are not correct, the product is zero.
1/4*3/4*3/4*0*1*2/9*1/9*4/9*4/9=0
CSA4006-Dr. Anirban Bhowmick
HMM-POS Tagging
When these words are correctly tagged, we get a probability greater than zero as shown
below
Calculating the product of these terms we get,
3/4*1/9*3/9*1/4*3/4*1/4*1*4/9*4/9=0.00025720164
25
CSA4006-Dr. Anirban Bhowmick
HMM-POS Tagging
For our example, keeping into consideration just three POS tags we have mentioned, 81 different
combinations of tags can be formed. In this case, calculating the probabilities of all 81 combinations seems
achievable. But when the task is to tag a larger sentence and all the POS tags in the Penn Treebank
project are taken into consideration, the number of possible combinations grows exponentially and this
task seems impossible to achieve. Now let us visualize these 81 combinations as paths and using the
transition and emission probability mark each vertex and edge as shown below.
26
CSA4006-Dr. Anirban Bhowmick
HMM-POS Tagging
The next step is to delete all the vertices and edges with probability zero, also the vertices which do not
lead to the endpoint are removed
27
Now there are only two paths that lead to the end, let us calculate the probability associated with each path.
<S>→N→M→N→N→<E> =3/4*1/9*3/9*1/4*1/4*2/9*1/9*4/9*4/9=0.00000846754
<S>→N→M→V→N→<E>=3/4*1/9*3/9*1/4*3/4*1/4*1*4/9*4/9=0.00025720164
Clearly, the probability of the second sequence is much higher and hence the HMM is going to tag each word in the sentence
according to this sequence.
CSA4006-Dr. Anirban Bhowmick
28
EEE1001-Dr. Anirban Bhowmick
Natural Language
Processing
CSA4006
Dr. Anirban Bhowmick
Assistant Professor
VIT Bhopal
Lecture : 11
Syllabus
Syllabus
3
Module 3:
Syntax Parsing: Tagsets for English - Part of
Speech Tagging- Rule based Part-of-speech
Tagging- Stochastic Part-of speech Tagging-
4
Transformation-Based Tagging- Context-Free
Grammars for English - Context-Free Rules and
Trees- The Noun Phrase. The Verb Phrase and
Subcategorization- Grammar Equivalence
&Normal Form- Finite State & Context-Free
Grammars.
Text Books:
Daniel Jurafsky and James H. Martin "Speech and
Language Processing: An Introduction to Natural
Language Processing, Computational Linguistics and
Speech recognition", Prentice Hall, 2nd edition, 2008. 5
Reference Books:
1. Roland R. Hausser "Foundations of Computational
Linguistics: Human- Computer Communication in
Natural Language", Paperback, MIT Press, 2011.
2. Christopher D. Manning and Hinrich Schuetze,
6
"Foundations of Statistical Natural Language
Processing" by MIT Press.
Module 3: Syntax
Parsing
Topic: Introduction
Optimizing HMM with Viterbi Algorithm
The Viterbi algorithm is a dynamic programming algorithm for finding the most likely sequence of hidden
states—called the Viterbi path—that results in a sequence of observed events, especially in the context of
Markov information sources and hidden Markov models (HMM).
In the previous section, we optimized the HMM and bought our calculations down from 81 to just two. Now
we are going to further optimize the HMM by using the Viterbi algorithm. Let us use the same example we
used before and apply the Viterbi algorithm to it
8
CSA4006-Dr. Anirban Bhowmick
Optimizing HMM with Viterbi Algorithm
Consider the vertex encircled in the above example. There are two paths leading to this vertex as shown
below along with the probabilities of the two mini-paths
Now we are really concerned with the mini path having the lowest probability. The same procedure is
done for all the states in the graph as shown in the figure below
CSA4006-Dr. Anirban Bhowmick
Optimizing HMM with Viterbi Algorithm
As we can see in the figure, the probabilities of
all paths leading to a node are calculated and we
remove the edges or path which has lower
probability cost. Also, you may notice some
nodes having the probability of zero and such
nodes have no edges attached to them as all the
paths are having zero probability. The graph
10
obtained after computing probabilities of all
paths leading to a node is shown below:
CSA4006-Dr. Anirban Bhowmick
Optimizing HMM with Viterbi Algorithm
To get an optimal path, we start from the end and trace backward, since each state has only one incoming
edge, This gives us a path as shown below
As you may have noticed, this algorithm returns only
one path as compared to the previous method which
suggested two paths. Thus by using this algorithm, we
saved us a lot of computations.
11
After applying the Viterbi algorithm the model tags the
sentence as following-
Will as a noun
Can as a modal
Spot as a verb
Mary as a noun
These are the right tags so we conclude that the model
can successfully tag the words with their appropriate
POS tags
CSA4006-Dr. Anirban Bhowmick
Bi-gram statistical tagger
12
CSA4006-Dr. Anirban Bhowmick
Transformation Based (Brill) Tagging
A hybrid approach
Like rule-based taggers, this tagging is based on rules
Like (most) stochastic taggers, rules are also automatically induced from hand-tagged data
Basic Idea: do a quick and dirty job first, and then use learned rules to patch things up
13
Overcomes the pure rule-based approach problems of being too expensive, too slow, too tedious etc…
An instance of Transformation-Based Learning.
Combine rules and statistics
Start with a dumb statistical system and patch up the typical mistakes it makes.
How dumb?
Assign the most frequent tag (unigram) to each word in the input
CSA4006-Dr. Anirban Bhowmick
Process
1. Choose a Baseline Tagger:
To start, you need a baseline POS tagger that assigns initial tags to words in a sentence. Common
baseline taggers include Hidden Markov Models (HMMs) or rule-based taggers.
2. Collect Training Data:
You need labeled training data, which consists of sentences with the correct POS tags for each word. This
data is used to learn transformation rules.
14
3. Initialize Tag Assignments:
Apply the baseline tagger to a sentence and assign initial POS tags to each word.
4. Generate Transformation Rules:
The core of the Brill tagging process involves learning transformation rules from the training data. These
rules are typically in the form of "if-then" statements that specify how to modify or correct POS tags. Rules
are learned based on observed tagging errors in the training data.
Example transformation rule: "If a noun is followed by 'to,' change the tag of 'to' to 'TO'."
CSA4006-Dr. Anirban Bhowmick
Process
5. Apply Transformation Rules:
Iterate through the sentence and apply transformation rules to modify the POS tags generated by the
baseline tagger.
6. Evaluate the Updated Tags:
After applying a set of transformation rules to a sentence, evaluate the updated POS tags. If the tagging
accuracy improves, keep the updated tags; otherwise, revert to the previous tagging.
15
7. Repeat:
Continue applying transformation rules and evaluating the tagging accuracy until a stopping criterion is
met, such as reaching a maximum number of iterations or achieving a desired level of accuracy.
8. Finalize Tags:
Once the iterative process is complete, the final POS tags are used as the output for the sentence.
CSA4006-Dr. Anirban Bhowmick
Syntax
By syntax, we mean various aspects of how words are strung together to form components of
sentences and how those components are strung together to form sentences. syntax comes from the
Greek sy´ntaxis, meaning “setting out together or arrangement”,
• that and after year last
• I saw you yesterday
• colorless green ideas sleep furiously
16
The kind of implicit knowledge of your native language that you had mastered by the time you were 3 or 4
years old without explicit instruction, not necessarily the type of rules you were later taught in school.
Why should you care?
Grammar checkers
Question answering
Information extraction
Machine translation
CSA4006-Dr. Anirban Bhowmick
Constituency
The idea: Groups of words may behave as a single unit or phrase, called a constituent.
E.g. Noun Phrase
Kermit the frog
they
December twenty-sixth
the reason he is running for president
17
Sentences have parts, some of which appear to have subparts. These groupings of words that go
together we will call constituents.
These units form coherent classes that behave in similar ways
For example, we can say that noun phrases can come before verbs
CSA4006-Dr. Anirban Bhowmick
Constituent Phrases
For constituents, we usually name them as phrases based on the word that
heads the constituent:
the man from Amherst is a Noun Phrase (NP) because the head man is a noun
extremely clever is an Adjective Phrase (AP) because the head clever is an adjective
down the river is a Prepositional Phrase (PP) because the head down is a preposition
killed the rabbit is a Verb Phrase (VP) because the head killed is a verb 18
Note that a word is a constituent (a little one). Sometimes words also act as phrases. In:
Joe grew potatoes.
Joe and potatoes are both nouns and noun phrases.
CSA4006-Dr. Anirban Bhowmick
Evidence constituency exists
1. They appear in similar environments (before a verb)
Kermit the frog comes on stage
They come to Massachusetts every summer
December twenty-sixth comes after Christmas
The reason he is running for president comes out only now.
But not each individual word in the constituent 19
*The comes our... *is comes out... *for comes out...
2. The constituent can be placed in a number of different locations
Constituent = Prepositional phrase: On December twenty-sixth
On December twenty-sixth I’d like to fly to Florida.
I’d like to fly on December twenty-sixth to Florida.
I’d like to fly to Florida on December twenty-sixth.
But not split apart
*On December I’d like to fly twenty-sixth to Florida.
*On I’d like to fly December twenty-sixth to Florida.
CSA4006-Dr. Anirban Bhowmick
20
EEE1001-Dr. Anirban Bhowmick
Natural Language
Processing
CSA4006
Dr. Anirban Bhowmick
Assistant Professor
VIT Bhopal
Lecture : 12
Syllabus
Syllabus
3
Module 3:
Syntax Parsing: Tagsets for English - Part of
Speech Tagging- Rule based Part-of-speech
Tagging- Stochastic Part-of speech Tagging-
4
Transformation-Based Tagging- Context-Free
Grammars for English - Context-Free Rules and
Trees- The Noun Phrase. The Verb Phrase and
Subcategorization- Grammar Equivalence
&Normal Form- Finite State & Context-Free
Grammars.
Text Books:
Daniel Jurafsky and James H. Martin "Speech and
Language Processing: An Introduction to Natural
Language Processing, Computational Linguistics and
Speech recognition", Prentice Hall, 2nd edition, 2008. 5
Reference Books:
1. Roland R. Hausser "Foundations of Computational
Linguistics: Human- Computer Communication in
Natural Language", Paperback, MIT Press, 2011.
2. Christopher D. Manning and Hinrich Schuetze,
6
"Foundations of Statistical Natural Language
Processing" by MIT Press.
Module 3: CFG
Topic: Introduction
Syntax
By syntax, we mean various aspects of how words are strung together to form components of
sentences and how those components are strung together to form sentences. syntax comes from the
Greek sy´ntaxis, meaning “setting out together or arrangement”,
• that and after year last
• I saw you yesterday
• colorless green ideas sleep furiously
8
The kind of implicit knowledge of your native language that you had mastered by the time you were 3 or 4
years old without explicit instruction, not necessarily the type of rules you were later taught in school.
Why should you care?
Grammar checkers
Question answering
Information extraction
Machine translation
CSA4006-Dr. Anirban Bhowmick
Constituency
The idea: Groups of words may behave as a single unit or phrase, called a constituent.
E.g. Noun Phrase
Kermit the frog
they
December twenty-sixth
the reason he is running for president
9
Sentences have parts, some of which appear to have subparts. These groupings of words that go
together we will call constituents.
These units form coherent classes that behave in similar ways
For example, we can say that noun phrases can come before verbs
CSA4006-Dr. Anirban Bhowmick
Constituent Phrases
For constituents, we usually name them as phrases based on the word that
heads the constituent:
the man from Amherst is a Noun Phrase (NP) because the head man is a noun
extremely clever is an Adjective Phrase (AP) because the head clever is an adjective
down the river is a Prepositional Phrase (PP) because the head down is a preposition
killed the rabbit is a Verb Phrase (VP) because the head killed is a verb 10
Note that a word is a constituent (a little one). Sometimes words also act as phrases. In:
Joe grew potatoes.
Joe and potatoes are both nouns and noun phrases.
CSA4006-Dr. Anirban Bhowmick
Evidence constituency exists
1. They appear in similar environments (before a verb)
Kermit the frog comes on stage
They come to Massachusetts every summer
December twenty-sixth comes after Christmas
The reason he is running for president comes out only now.
But not each individual word in the constituent 11
*The comes our... *is comes out... *for comes out...
2. The constituent can be placed in a number of different locations
Constituent = Prepositional phrase: On December twenty-sixth
On December twenty-sixth I’d like to fly to Florida.
I’d like to fly on December twenty-sixth to Florida.
I’d like to fly to Florida on December twenty-sixth.
But not split apart
*On December I’d like to fly twenty-sixth to Florida.
*On I’d like to fly December twenty-sixth to Florida.
CSA4006-Dr. Anirban Bhowmick
Context-free grammar
The most common way of modeling constituency.
CFG = Context-Free Grammar = Phrase Structure Grammar
= BNF = Backus-Naur Form
The idea of basing a grammar on constituent structure dates back to Wilhem Wundt (1890), but not 12
formalized until Chomsky (1956), and, independently, by Backus (1959).
Consist of:
Terminals: We’ll take these to be words
Non-Terminals: The constituents in a language Like noun phrase, verb phrase and sentence
Rules: Rules are equations that consist of a single non-terminal on the left and any number of terminals
and non-terminals on the right.
CSA4006-Dr. Anirban Bhowmick
CFG 4 Tuple
G = {T, N, S, R}
T is set of terminals (lexicon)
N is set of non-terminals
S is start symbol (one of the non-terminals)
R is rules/productions of the form X → γ, where X is a nonterminal and γ is a sequence of
terminals and non-terminals (may be empty).
13
A grammar G generates a language L.
CSA4006-Dr. Anirban Bhowmick
CFG
G = {T, N, S, R}
T = {that, this, a, the, man, book, flight, meal, include, read, does}
N = {S, NP, NOM, VP, Det, Noun, Verb, Aux}
S=S
R={
S → NP VP Det → that | this | a | the
S → Aux NP VP Noun → book | flight | meal | man 14
S → VP Verb → book | include | read
NP → Det NOM Aux → does
NOM → Noun
NOM → Noun NOM
VP → Verb
VP → Verb NP
}
CSA4006-Dr. Anirban Bhowmick
The man read this book
CFG Example
S → NP VP
→ Det NOM VP
→ The NOM VP
→ The Noun VP
→ The man VP 15
→ The man Verb NP
→ The man read NP
→ The man read Det NOM
→ The man read this NOM
→ The man read this Noun
→ The man read this book
CSA4006-Dr. Anirban Bhowmick
Parse tree
16
The man read this book I prefer a morning flight
CSA4006-Dr. Anirban Bhowmick
CFGs can capture recursion
Example of seemingly endless recursion of embedded
prepositional phrases:
PP → Prep NP
NP → Noun PP
17
[S The mailman ate his [NP lunch [PP with his friend [PP from the cleaning staff [PP of the building
[PP at the intersection [PP on the north end [PP of town]]]]]]].
CSA4006-Dr. Anirban Bhowmick
Grammaticality
A CFG defines a formal language = the set of all sentences (strings of words) that can be derived by
the grammar.
Sentences in this set said to be grammatical.
Sentences outside this set said to be ungrammatical.
18
CSA4006-Dr. Anirban Bhowmick
Parsing
Parsing is the process of taking a string and a grammar and returning a (or multiple) parse tree(s)
for that string
It is completely analogous to running a finite state transducer with a tape
It’s just more powerful there are languages we can capture with CFGs that we can’t capture with
finite state machines
A recognizer is a program for which a given grammar and a given sentence returns YES if the
sentence is accepted by the grammar (i.e.the sentence is in the language), and NO otherwise. 19
A parser in addition to doing the work of a recognizer also returns the set of parse trees for the
string.
Top-Down Parsing and Bottom-Up Parsing are used for parsing a tree to reach the starting node of
the tree. Both the parsing techniques are different from each other. The most basic difference between
the two is that top-down parsing starts from top of the parse tree, while bottom-up parsing starts from
the lowest level of the parse tree.
CSA4006-Dr. Anirban Bhowmick
Top-down parsing
Top-down parsing is goal-directed
A top-down parser starts with a list of constituents to be built.
• It rewrites the goals in the goal list by matching one against the LHS of the grammar rules, and
expanding it with the RHS,
• attempting to match the sentence to be derived 20
If a goal can be rewritten in several ways, then there is a choice of which rule to apply (search
problem)
Can use depth-first or breadth-first search, and goal ordering.
CSA4006-Dr. Anirban Bhowmick
Top-down parsing example (Breadth-first)
21
CSA4006-Dr. Anirban Bhowmick
Contd.
22
CSA4006-Dr. Anirban Bhowmick
Problems with top-down parsing
Left recursive rules... e.g. NP → NP PP... lead to infinite recursion
Will do badly if there are many different rules for the same LHS. Consider if there are 600 rules
for S, 599 of which start with NP, but one of which starts with a V, and the sentence starts with
a V.
Useless work: expands things that are possible top-down but not there (no bottom-up evidence
for them).
Top-down parsers do well if there is useful grammar-driven control: search is directed by the 23
grammar.
Top-down is hopeless for rewriting parts of speech (pre-terminals) with words (terminals). In
practice that is always done bottom-up as lexical lookup.
Repeated work: anywhere there is common substructure
CSA4006-Dr. Anirban Bhowmick
Bottom-up parsing
Top-down parsing is data-directed.
The initial goal list of a bottom-up parser is the string to be parsed.
If a sequence in the goal list matches the RHS of a rule, then this sequence may be replaced by the
LHS of the rule.
Parsing is finished when the goal list contains just the start symbol. If the RHS of several rules match
the goal list, then there is a choice of which rule to apply (search problem)
24
Can use depth-first or breadth-first search, and goal ordering.
The standard presentation is as shift-reduce parsing
CSA4006-Dr. Anirban Bhowmick
Bottom-up parsing example
25
CSA4006-Dr. Anirban Bhowmick
Shift-reduce parsing
26
CSA4006-Dr. Anirban Bhowmick
Stochastic Probabilistic sequence models
Start with the sentence to be parsed in an input buffer.
• a ”shift” action corresponds to pushing the next input symbol from the buffer onto the stack
• a ”reduce” action occurs when we have a rule’s RHS on top of the stack. To perform the reduction,
we pop the rule’s RHS off the stack and replace it with the terminal on the LHS of the corresponding
27
rule.
(When either ”shift” or ”reduce” is possible, choose one arbitrarily.)
If you end up with only the Start symbol on the stack, then success!
If you don’t, and you cannot and no ”shift” or ”reduce” actions are possible,
backtrack.
CSA4006-Dr. Anirban Bhowmick
Contd.
In a top-down parser, the main decision was which production rule to pick.
In a bottom-up shift-reduce parser there are two decisions:
1. Should we shift another symbol, or reduce by some rule?
2. If reduce, then reduce by which rule?
both of which can lead to the need to backtrack 28
Problem:
• Unable to deal with empty categories: termination problem, unless rewriting empties as
constituents is somehow restricted (but then it’s generally incomplete)
• Useless work: locally possible, but globally impossible
• Inefficient when there is great lexical ambiguity (grammar-driven control might help here).
Conversely, it is data-directed: it attempts to parse the words that are there
• Repeated work: anywhere there is common substructure.
CSA4006-Dr. Anirban Bhowmick
Noun Phrase
The noun phrase can be viewed as revolving around a head, the central noun in the noun
phrase. The syntax of English allows for both
• Prenominal prehead modifiers
• Post nominal (post head) modifiers
Prenominal prehead modifiers are words or phrases that appear before the noun and modify it.
These modifiers provide additional information about the noun. Here's an example: 29
The big, red apple: In this noun phrase, "big" and "red" are prenominal prehead modifiers that
provide more details about the noun "apple.“
Postnominal (post head) modifiers are words or phrases that appear after the noun and
modify it. These modifiers also offer additional information about the noun. Here's an example:
The car with the broken windshield: In this noun phrase, "with the broken windshield" is a
postnominal modifier that provides more information about the noun "car."
CSA4006-Dr. Anirban Bhowmick
Noun Phase
Noun phrases can begin with a determiner, as follows:
a stop, the flights, that fare, this flight, those flights, any flights, some flight
Word classes that appear in the NP before the determiner are called
predeterminers .
A number of different kinds of word classes can appear in the NP between the 30
determiner and the head noun.
• Cardinal numbers Eg two friends, one stop
• Ordinal numbers include first, second, third etc but also words like next, last,
past, other, and another Eg the first one, the next day, the second leg, the last
flight, the other American flight, any other fares.
• Quantifiers many, few, several occur only with plural count nouns Eg many
fares
• The quantifiers much and a little occur only with noncount nouns
CSA4006-Dr. Anirban Bhowmick
Noun Phase
Noun phrases can start with determiners...
Determiners can be
Simple lexical items: the, this, a, an, etc.
A car 31
Or simple possessives
John’s car
Or complex recursive versions of that
John’s sister’s husband’s son’s car
CSA4006-Dr. Anirban Bhowmick
Noun Phase
Adjectives occur after quantifiers but before nouns.
A first class fare
A nonstop flight
The longest layover
The earliest lunch flight
32
Adjectives can also be grouped into a phrase called an adjective phrase AP.
APs can have an adverb before the adjective
Eg. the least expensive fare
All the options for prenominal modifiers are combined with one rule as follows:
NP--(Det) (Card) (Ord) (Quant) (AP) Nominal
the use of parentheses () to mark optional constituents.
CSA4006-Dr. Anirban Bhowmick
Noun Phase
A head noun can be followed by Post Modifiers
.
Three kinds Prepositional phrases
• Flights from Seattle
Non finite clauses
• Flights arriving before noon
Relative clauses
• Flights that serve breakfast
• any stopovers [for Delta seven fifty one]
• all flights [from Cleveland] [to Newark]
• arrival [in San Jose] [before seven p.m]
• a reservation [on flight six oh six] [from Tampa]
[to Montreal]
Here’s a new NP rule to account for one to three PP postmodifiers:
Nominal : Nominal PP
CSA4006-Dr. Anirban Bhowmick
Noun Phase
•The three most common kinds of non finite postmodifiers are the gerundive ing ed, and
infinitive forms
• Gerundive postmodifiers are so called because they consist of a verb phrase that begins with
the gerundive ing form of the verb
In the following examples, the verb phrases happen to all have only prepositional phrases after
verb. 34
• any of those (leaving on Thursday)
• any flights (arriving after eleven a.m)
• flights (arriving within thirty minutes of each other)
The use of a new nonterminal GerundVP:
Nominal : Nominal GerundVP
CSA4006-Dr. Anirban Bhowmick
Noun Phase
Rules for GerundVP constituents by duplicating all of our VP productions, substituting GerundV
for V.
• GerundVP -- GerundV NP
• GerundV PP
• GerundV
• GerundV NP PP
35
GerundV can then be defined as:
GerundV being/preferring/arriving /leaving/…
A postnominal relative clause (more correctly a restrictive relative clause), is a clause that often
begins with a relative pronoun (that and who are the most common)
CSA4006-Dr. Anirban Bhowmick
Agreement
Constraints that hold among various constituents
For example, in English, determiners and the head nouns in NPs have to agree in their
number.
Which of the following cannot be parsed by the rule
NP Det Nominal ?
36
(O) This flight (X) This flights
(O) Those flights (X) Those flight
Which of the following cannot be parsed by the rule
NP Det Nominal ?
This rule does not handle agreement! (The rule does not detect whether the agreement is
correct or not
(O) This flight (X) This flights
(O) Those flights (X) Those flight
CSA4006-Dr. Anirban Bhowmick
Problem
Our earlier NP rules are clearly deficient since they don’t capture the agreement constraint
NP: Det Nominal
Accepts, and assigns correct structures, to grammatical examples (this flight)
But its also happy with incorrect examples (*these flight)
Such a rule is said to overgenerate 37
CSA4006-Dr. Anirban Bhowmick
THE VERB PHRASE AND SUBCATEGORIZATION
The verb phrase consists of the verb and a number of other constituents -arguments
38
But, even though there are many valid VP rules in English, not all verbs are
allowed to participate in all those VP rules
We can subcategorize the verbs in a language according to the sets of VP rules
that they participate in
This is a modern take on the traditional notion of transitive/intransitive
Modern grammars may have 100 s or such classes
CSA4006-Dr. Anirban Bhowmick
Contd.
Sneeze: John sneezed
Find: Please find [a flight to NY] NP
Give: Give [me] NP [a cheaper fare] NP
Help: Can you help [me] NP [with a flight] PP
Prefer: I prefer [to leave earlier] TO VP
Told: I was told [United has a flight] S
39
• *John sneezed the book
• *I prefer United has a flight
• *Give with a flight
As with agreement phenomena, we need a way to formally express the constraints!
CSA4006-Dr. Anirban Bhowmick
Contd.
The various rules for VPs overgenerate .
They permit the presence of strings containing verbs and arguments that don’t go together
For example : VP --> V NP
therefore Sneezed the book is a VP since “ sneeze ” is a verb and “ the book ” is a valid NP
40
CSA4006-Dr. Anirban Bhowmick
Grammar Equivalence & Normal Form
A formal language is defined as a (possibly infinite) set of strings of words Two kinds of grammar
equivalence
Weak equivalence
Strong equivalence
Two grammars are strongly equivalent generate the same set of string and assign the same phrase
structure to each sentence (allowing merely for renaming of the non terminal symbols)
Two grammars are weakly equivalent generate the same set of strings but do not assign the same 41
phrase structure to each sentence
It is sometimes useful to have a normal form for grammars, in which each of the productions takes a
particular form
For example a context free grammar is in Chomsky normal form (CNF)
Any grammar can be converted into a weakly equivalent Chomsky normal form grammar
For example a rule of the form
A-BCD
can be converted into the following two CNF rules
A-BX
X-CD
CSA4006-Dr. Anirban Bhowmick
42
EEE1001-Dr. Anirban Bhowmick
Natural Language
Processing
CSA4006
Dr. Anirban Bhowmick
Assistant Professor
VIT Bhopal
Lecture : 13
Syllabus
Syllabus
3
Module 4:
Semantics: Computational Desiderata for
Representations- Meaning Structure of
Language- First Order Predicate Calculus-
Elements of FOPC- The Semantics of FOPC- 4
Syntax-Driven Semantic Analysis- Attachments
for a Fragment of English.
Text Books:
Daniel Jurafsky and James H. Martin "Speech and
Language Processing: An Introduction to Natural
Language Processing, Computational Linguistics and
Speech recognition", Prentice Hall, 2nd edition, 2008. 5
Reference Books:
1. Roland R. Hausser "Foundations of Computational
Linguistics: Human- Computer Communication in
Natural Language", Paperback, MIT Press, 2011.
2. Christopher D. Manning and Hinrich Schuetze,
6
"Foundations of Statistical Natural Language
Processing" by MIT Press.
Module 4:
Semantic
Topic: Introduction
Semantic Analysis
Semantic analysis in natural language processing (NLP) refers to the process of understanding the
meaning of words, phrases, sentences, or even entire documents. It goes beyond syntactic analysis,
which focuses on the grammatical structure of language, to extract the underlying meaning and
context.
Here are some key aspects of semantic analysis in NLP:
Word Sense Disambiguation (WSD): Words often have multiple meanings depending on the context
in which they are used. WSD is the task of determining the correct sense of a word in a given 8
context. For example, the word "bank" could refer to a financial institution or the side of a river.
Named Entity Recognition (NER): NER involves identifying and classifying entities such as names of
people, organizations, locations, dates, and other specific terms in a text. This helps in
understanding the key entities and their relationships within a document.
Semantic Role Labeling (SRL): SRL aims to identify the roles of different components of a sentence,
such as the subject, object, and predicate. It helps in understanding the relationships between
entities and their actions in a given context.
CSA4006-Dr. Anirban Bhowmick
Semantic Analysis
Coreference Resolution: This involves determining when two or more expressions in a text refer to the
same entity. For example, in the sentence "John went to the store. He bought some groceries," resolving
the pronoun "He" to refer to "John" requires coreference resolution.
Sentiment Analysis: While often associated more with the emotional aspect of language, sentiment
analysis also involves understanding the underlying meaning of text. It helps determine whether a piece
of text expresses a positive, negative, or neutral sentiment.
9
Semantic Similarity: This involves measuring the degree of similarity between words, phrases, or
sentences in terms of meaning. It is useful in tasks like information retrieval, document clustering, and
question answering.
Word Embeddings and Vector Representations: Techniques like word embeddings (e.g., Word2Vec,
GloVe, and BERT) represent words in a continuous vector space where semantically similar words are
closer in the vector space. This allows algorithms to capture semantic relationships between words.
Frame Semantics and Ontologies: Understanding the frames or scenarios in which words and phrases
are used can contribute to a deeper understanding of meaning.
CSA4006-Dr. Anirban Bhowmick
Meaning Representation Language
In natural language processing (NLP), meaning representation languages are formal languages or
frameworks used to represent the meaning of linguistic expressions in a structured and interpretable
way. These representations are essential for tasks such as semantic analysis, machine translation,
question answering, and other applications where understanding the meaning of natural language is
crucial.
But unlike parse trees, these representations aren’t primarily descriptions of the structure of the inputs 10
Consider the following everyday language tasks that require some form of semantic processing
Answering an essay question on an exam
Deciding what to order at a restaurant by reading a menu
Learning to use a new piece of software by reading the manual
Realizing that you’ve been insulted
Following a recipe
CSA4006-Dr. Anirban Bhowmick
Contd.
For example, some of the knowledge of the world needed to perform the above tasks includes:
Answering and grading essay questions requires background knowledge about
The topic of the question
The desired knowledge level of the students
How such questions are normally answered
11
Learning to use a piece of software by reading a manual
Giving advice about how to do the same
Requires deep knowledge about current computers
The specific software in question
Similar software applications
Knowledge about users in general
CSA4006-Dr. Anirban Bhowmick
Computational Desiderata for
Representation
Computational desiderata refer to the desired properties or characteristics that representations should
possess to effectively capture and model the meaning of language.
To focus this discussion, we will consider in more detail the task of giving advice about restaurants to
tourists. In this discussion, we will assume that we have a computer system that accepts spoken
language queries from tourists and construct appropriate responses by using a knowledge base of
relevant domain knowledge.
12
Verifiability
Unambiguous Representations
Canonical Form
Inference and Variables
Expressiveness
CSA4006-Dr. Anirban Bhowmick
Verifiability
Verifiability: The system’s ability to compare representations to facts in memory
The most straightforward way to implement this notion is make it possible for a system to compare,
or match the representation of the meaning of an input against the representations in its knowledge
base its store of information about its world.
Does Maharani serve vegetarian food? 13
Serves(Maharani; Vegetarian Food)
Input matched against the knowledge base of facts about a set of restaurants
Matching the input proposition in its knowledge base, it can return an affirmative answer
Otherwise, it must either say No if its knowledge of local restaurants is complete, or say that it does
not know
CSA4006-Dr. Anirban Bhowmick
Unambiguous Representations
The domain of semantics is subject to ambiguity
Single linguistic inputs can legitimately have different meaning representations assigned to them
based on the circumstances in which they occur.
The cat is on the mat
Ambiguity:
The phrase "on the mat" might have multiple interpretations, as it could refer to a physical location or 14
imply a scolding or disciplinary action.
Unambiguous representations are crucial for NLP tasks to enhance the accuracy and reliability of
natural language understanding systems.
CSA4006-Dr. Anirban Bhowmick
Vagueness
A concept closely related to ambiguity is vagueness
Like ambiguity, vagueness can make it difficult to determine what to do with a particular input
based on its meaning representation
Vagueness, however, does not give rise to multiple representations
Consider the following request as an example
I want to eat Italian food
Use of the phrase Italian food may provide enough information for a restaurant advisor to provide 15
reasonable recommendations
It is nevertheless quite vague as to what the user really wants to eat
A vague representation of the meaning of this phrase may be appropriate for some purposes,
while a more specific representation may be needed for other purposes
CSA4006-Dr. Anirban Bhowmick
Canonical Form
The notion that single sentences can be assigned multiple meanings leads to the related phenomenon of
distinct inputs that should be assigned the same meaning representation
Does Maharani have vegetarian dishes?
Do they have vegetarian food at Maharani?
Are vegetarian dishes served at Maharani? 16
Does Maharani serve vegetarian fare?
CSA4006-Dr. Anirban Bhowmick
Inference and Variables
Can vegetarians eat at Maharani?
The term inference to refer generically to a system’s ability to draw valid conclusions based on the
meaning representation of inputs and its store of background knowledge
It must be possible for the system to draw conclusions about the truth of propositions that are not
explicitly represented in the knowledge base, but are nevertheless logically derivable from the
propositions that are present 17
I’d like to find a restaurant where I can get vegetarian food.
In this examples, this request does not make reference to any particular restaurant
The user is stating that they would like information about an unknown and unnamed entity that is a
restaurant that serves vegetarian food
Answering this request requires a more complex kind of matching that involves the use of variables
A representation containing such variables as follows
Serves(x; Vegetarian Food)
CSA4006-Dr. Anirban Bhowmick
Expressiveness
Expressiveness in meaning representation in NLP refers to the ability of a representation system to
capture the richness and diversity of meanings present in natural language. An expressive
representation should be able to convey nuanced relationships, distinctions, and semantic intricacies
inherent in human language. Here's an example to illustrate expressiveness
The conference room echoed with the enthusiastic applause of the audience.
This representation captures the expressiveness of the sentence by not only representing the basic 18
actions and entities but also incorporating additional details about the manner of applause and the
specific location of the event. It goes beyond a simple surface-level representation and delves into the
nuanced aspects of the sentence's meaning.
CSA4006-Dr. Anirban Bhowmick
Meaning Structure of Language
These include a variety of conventional form
Meaning associations
Word order regularities
Tense systems
Conjunctions and quantifiers
A fundamental predicate argument structure
19
A predicate is a statement about a subject that either is true or false. It expresses a property or a
relation. Predicates often use verbs to convey actions or states.
Examples:
The cat is on the mat.
Predicate: "is on the mat"
Subject: "The cat"
Predicates: Primarily Verbs , VPs , Sentences, sometimes Nouns and NPs
Arguments: Primarily Nouns, Nominals, NPs, PPs.
CSA4006-Dr. Anirban Bhowmick
Meaning Structure of Language
Argument:
An argument is a value that is applied to a function or, in logic, a subject that satisfies a predicate. In
simpler terms, it is what the predicate is about.
Examples:
1.In "The cat is on the mat," "The cat" is the argument of the predicate "is on the mat."
2.In "She likes to read books," "She" is the argument of the predicate "likes to read books." 20
3.In "The sun sets in the west," "The sun" is the argument of the predicate "sets in the west."
Predicates: Primarily Verbs , VPs , Sentences, sometimes Nouns and NPs
Arguments: Primarily Nouns, Nominals, NPs, PPs.
CSA4006-Dr. Anirban Bhowmick
Contd.
These examples can be classified as having one of the three syntactic argument frames
I want Italian food NP want NP
I want to spend less than five dollars NP want Inf VP
I want it to be close by here NP want NP Inf VP
These syntactic frames specify the number, position and syntactic category of the arguments that are
expected.
The frame for the variety of want that appears in Example 1 specifies the following facts 21
There are two arguments to this predicate.
Both arguments must be NPs.
The first argument is pre verbal and plays the role of the subject.
The second argument is post verbal and plays the role of the direct object.
CSA4006-Dr. Anirban Bhowmick
Contd.
Semantic roles and Semantic restrictions on these roles
The notion of a semantic role can be understood by looking at the similarities among the arguments in
Examples 1 to 4.
The study of roles associated with specific verbs and across classes of verbs is usually referred to as
thematic role or case role
The notion of semantic restrictions arises directly from these semantic roles
Consider the following phrases from the BERP corpus 22
An Italian restaurant under fifteen dollars
In this example, the meaning representation associated with the preposition under can be seen as
having
something like the following structure
Under(Italian Restaurant ; $15)
Prepositions can be characterized as two argument predicates where the first argument is an object that
is being placed in some relation to the second argument
CSA4006-Dr. Anirban Bhowmick
Contd.
Another non verb based predicate argument structure example
Make a reservation for this evening for a table for two persons at 8
The predicate argument structure is based on the concept underlying the noun reservation, rather
than make, the main verb in the phrase
This example gives rise to a four argument predicate structure like the following
23
Reservation(Today; 8PM ; 2)
Any useful meaning representation language must be organized supports the specification of
semantic predicate argument structures
This support must include support for the kind of semantic information that languages present
Variable arity predicate argument structures
The semantic labeling of arguments to predicates
The statement of semantic constraints on the fillers of argument roles
CSA4006-Dr. Anirban Bhowmick
24
EEE1001-Dr. Anirban Bhowmick
Natural Language
Processing
CSA4006
Dr. Anirban Bhowmick
Assistant Professor
VIT Bhopal
Lecture : 14
Syllabus
Syllabus
3
Module 4:
Semantics: Computational Desiderata for
Representations- Meaning Structure of
Language- First Order Predicate Calculus-
Elements of FOPC- The Semantics of FOPC- 4
Syntax-Driven Semantic Analysis- Attachments
for a Fragment of English.
Text Books:
Daniel Jurafsky and James H. Martin "Speech and
Language Processing: An Introduction to Natural
Language Processing, Computational Linguistics and
Speech recognition", Prentice Hall, 2nd edition, 2008. 5
Reference Books:
1. Roland R. Hausser "Foundations of Computational
Linguistics: Human- Computer Communication in
Natural Language", Paperback, MIT Press, 2011.
2. Christopher D. Manning and Hinrich Schuetze,
6
"Foundations of Statistical Natural Language
Processing" by MIT Press.
Module 4:
Semantic
Topic: Introduction
Review
CSA4006-Dr. Anirban Bhowmick
Meaning Structure of Language
A predicate is a statement about a subject that either is true or false. It expresses a property or a
relation. Predicates often use verbs to convey actions or states.
Examples:
The cat is on the mat.
Predicate: "is on the mat"
Subject: "The cat"
9
Predicates: Primarily Verbs , VPs , Sentences, sometimes Nouns and NPs
Arguments: Primarily Nouns, Nominals, NPs, PPs.
CSA4006-Dr. Anirban Bhowmick
Meaning Structure of Language
Argument:
An argument is a value that is applied to a function or, in logic, a subject that satisfies a predicate. In
simpler terms, it is what the predicate is about.
Examples:
1.In "The cat is on the mat," "The cat" is the argument of the predicate "is on the mat."
2.In "She likes to read books," "She" is the argument of the predicate "likes to read books." 10
3.In "The sun sets in the west," "The sun" is the argument of the predicate "sets in the west."
Predicates: Primarily Verbs , VPs , Sentences, sometimes Nouns and NPs
Arguments: Primarily Nouns, Nominals, NPs, PPs.
CSA4006-Dr. Anirban Bhowmick
Contd.
These examples can be classified as having one of the three syntactic argument frames
I want Italian food NP want NP
I want to spend less than five dollars NP want Inf VP
I want it to be close by here NP want NP Inf VP
These syntactic frames specify the number, position and syntactic category of the arguments that are
expected.
The frame for the variety of want that appears in Example 1 specifies the following facts 11
There are two arguments to this predicate.
Both arguments must be NPs.
The first argument is pre verbal and plays the role of the subject.
The second argument is post verbal and plays the role of the direct object.
CSA4006-Dr. Anirban Bhowmick
Contd.
Semantic roles and Semantic restrictions on these roles
The notion of a semantic role can be understood by looking at the similarities among the arguments in
Examples 1 to 4.
The study of roles associated with specific verbs and across classes of verbs is usually referred to as
thematic role or case role
The notion of semantic restrictions arises directly from these semantic roles
Consider the following phrases from the BERP corpus 12
An Italian restaurant under fifteen dollars
In this example, the meaning representation associated with the preposition under can be seen as
having something like the following structure
Under(Italian Restaurant ; $15)
Prepositions can be characterized as two argument predicates where the first argument is an object that
is being placed in some relation to the second argument
CSA4006-Dr. Anirban Bhowmick
Contd.
Another non verb based predicate argument structure example
Make a reservation for this evening for a table for two persons at 8
The predicate argument structure is based on the concept underlying the noun reservation, rather
than make, the main verb in the phrase
This example gives rise to a four argument predicate structure like the following
13
Reservation(Today; 8PM ; 2)
Any useful meaning representation language must be organized supports the specification of
semantic predicate argument structures
This support must include support for the kind of semantic information that languages present
Variable arity predicate argument structures
The semantic labeling of arguments to predicates
The statement of semantic constraints on the fillers of argument roles
CSA4006-Dr. Anirban Bhowmick
Propositional Logic
The simplest, and most abstract logic we can study is called propositional logic.
• Definition: A proposition is a statement that can be either true or false; it must be one or
the other, and it cannot be both.
Examples:
The fan is on,
2+3 = 5,
Where as, 14
1+2,
Where is John?
There are two types of Propositions:
Atomic Propositions
Compound propositions
CSA4006-Dr. Anirban Bhowmick
Propositional Logic
Atomic Propositions:
Definition: An atomic proposition is one whose truth or falsity does not depend on the truth or
falsity of any other proposition
Example:
"The Sun is cold“
2+2 is 4
15
Compound Propositions:
Compound propositions are constructed by combining simpler or atomic propositions, using
parenthesis and logical connectives.
Example:
"It is raining today, and street is wet."
"Ankit is a doctor, and his clinic is in Mumbai."
CSA4006-Dr. Anirban Bhowmick
Propositional Logic
Logical Connectives:
Implication: In propositional logic, we have a connective that combines two propositions into a new 16
proposition called the conditional
If it is raining, then the street is wet.
Let P= It is raining, and Q= Street is wet, so it is
represented as P → Q
CSA4006-Dr. Anirban Bhowmick
Propositional Logic
Biconditional: A sentence such as P⇔ Q is a Biconditional sentence, example If I am
breathing, then I am alive. It is written as p iff q
P= I am breathing, Q= I am alive, it can be represented as P ⇔ Q.
Definition: If p and q are arbitrary propositions, then the biconditional of p and q is written: p ⇔
q and will be true iff either:
1. p and q are both true; or 17
2. p and q are both false.
CSA4006-Dr. Anirban Bhowmick
Propositional Logic
We can nest complex formulae as deeply as we want.
• We can use parentheses i.e., ),(, to disambiguate formulae.
• EXAMPLES. If p, q, r, s and t are atomic propositions, then all of the following are
formulae:
p∧q⇒r
p ∧ (q ⇒ r)
(p ∧ (q ⇒ r)) ∨ s 18
((p ∧ (q ⇒ r)) ∨ s) ∧ t
EXAMPLE. Suppose we have a valuation 𝜐, such that:
𝜐(p) = F
𝜐(q) = T
𝜐(r) = F
Then we truth value of (p ∨ q) ⇒ r is
evaluated by:
CSA4006-Dr. Anirban Bhowmick
First Order Predicate Calculus
First-Order Predicate Calculus (FOPC) plays a crucial role in representing and reasoning
about linguistic structures and meanings. It serves as a foundation for semantic analysis
and knowledge representation in NLP systems. Let's delve into a detailed explanation with
an example relevant to NLP.
Allows us to break sentences into predicates, subjects and objects, while also allowing us to
use quantifiers like “all”, “each”, “some” etc. 19
Blackburn & Bos make a strong argument for using first-order logic as the meaning
representation.
Powerful, flexible, general.
CSA4006-Dr. Anirban Bhowmick
First Order Predicate Calculus
FOL symbols
○ Constants: john, mary
○ Predicates & relations: man, walks, loves
○ Variables: x, y
○ Logical connectives: ∧ ∨ ¬ →
○ Quantifiers: ∀ ∃ 20
○ Other punctuation: parens, commas
FOL formulae
○ Atomic formulae: loves(john, mary)
○ Connective applications: man(john) ∧ loves(john, mary)
○ Quantified formulae: ∃x (man(x))
CSA4006-Dr. Anirban Bhowmick
Predicates categories
One place: Intransitive verbs, common nouns, adjectives
Dog(x), Happy (x)
Two Place: Transitive verbs, prepositions
Likes(x,y), In(x,y) 21
Three Place: Ditransitive verbs
Gives(x,y,z)
CSA4006-Dr. Anirban Bhowmick
Quantifier
Quantifiers generate quantification and specify the number of specimen in the universe.
Quantifiers allow us to determine or identify the range and scope of the variable in a logical expression.
There are two types of quantifiers:
Universal quantifier: for all, everyone, everything.
Existential quantifier: for some, at least one.
1. Universal quantifiers
Universal quantifiers specify that the statement within the range is true for everything or every instance of a
22
particular thing.
Universal quantifiers are denoted by a symbol (∀) that looks like an inverted A. In a universal quantifier, we
use →.
If x is a variable, then ∀x can read as:
For all x
For every x
For each x
Example
Every kid likes football ∀x kid(x) → likes(x, football)
CSA4006-Dr. Anirban Bhowmick
Quantifier
2. Existential quantifiers
Existential quantifiers are used to express that the statement within their scope is true for at least one
instance of something.
∃, which looks like an inverted E, is used to represent them. We always use AND or conjunction symbols.
If x is a variable, the existential quantifier will be ∃x:
For some x 23
There exists an x
For at least one x
Example
Some people like Football. ∃x: people(x) ∧ likes Football(x)
CSA4006-Dr. Anirban Bhowmick
Scope and Free & Bound Variables
∀x[Person(x)] ∧ Happy(x)
(Every x is a person) and x is happy
Everyone is a person and he is happy
∀x[Person(x) ∧ Happy(x)] 24
(Every x is a person and every x is happy)
Everyone is happy
CSA4006-Dr. Anirban Bhowmick
Examples
1. Some boys hate football
∃x: boys(x) ∧ hate( x, Football)
2. Every person who buys a Policy is smart
∀x ∀y: Person(x) ∧ Policy(y)^buys(x,y)Smart(x)
25
3. No person buys expensive Policy
∀x ∀y: Person(x) ∧ Policy(y)^expensive(y) ¬ buys(x,y)
4. Mary loves everyone
∀x: (person(x) → love (Mary, x))
5. Everyone loves everyone except himself
∀x ∀y: (x ≠y → L(x,y))
CSA4006-Dr. Anirban Bhowmick
Scope Ambiguity
Every student loves some teacher
(Every student)x loves (some teacher)y
One way : (Every student)x (some teacher)y x loves y
∀x [student(x) ∃y[teacher(y) ∧ loves(x,y)]] 26
Another way : (some teacher)y (Every student)x x loves y
∃y[teacher(y)] ^ ∀x [student(x) loves(x,y)]]
CSA4006-Dr. Anirban Bhowmick
Variables and Quantifiers
Consider the following example.
A restaurant that serves Mexican food near ICSI.
The following would be a reasonable representation of the
meaning of such a phrase. 27
Restaurant(x) ∧ Serves(x; Mexican Food) ∧ Near((Location of (x); Location of (ICSI))
CSA4006-Dr. Anirban Bhowmick
Contd.
For example, if AyCaramba is a Mexican restaurant near ICSI, then substituting
AyCaramba for x results in the following logical formula
Restaurant(AyCaramba) ∧ Serves(AyCaramba; Mexican Food) ∧ Near((Location of
(AyCaramba); Location of (ICSI))
28
Based on the semantics of the operator ^, this sentence will be true if all of its three
component atomic formulas are true
CSA4006-Dr. Anirban Bhowmick
Syntax
I only have five dollars and I don’t have a lot of time
Have(Speaker; Five Dollars) ∧ ¬ Have(Speaker Lot Of Time)
The semantic representation for this example is built up in a straightforward way from
semantics of the individual clauses through the use of the and ¬ operators 29
CSA4006-Dr. Anirban Bhowmick
30
EEE1001-Dr. Anirban Bhowmick
Natural Language
Processing
CSA4006
Dr. Anirban Bhowmick
Assistant Professor
VIT Bhopal
Lecture : 15
Syllabus
Syllabus
3
Module 4:
Semantics: Computational Desiderata for
Representations- Meaning Structure of
Language- First Order Predicate Calculus-
Elements of FOPC- The Semantics of FOPC- 4
Syntax-Driven Semantic Analysis- Attachments
for a Fragment of English.
Text Books:
Daniel Jurafsky and James H. Martin "Speech and
Language Processing: An Introduction to Natural
Language Processing, Computational Linguistics and
Speech recognition", Prentice Hall, 2nd edition, 2008. 5
Reference Books:
1. Roland R. Hausser "Foundations of Computational
Linguistics: Human- Computer Communication in
Natural Language", Paperback, MIT Press, 2011.
2. Christopher D. Manning and Hinrich Schuetze,
6
"Foundations of Statistical Natural Language
Processing" by MIT Press.
Module 4:
Semantic
Topic: Introduction
Review
CSA4006-Dr. Anirban Bhowmick
FOPL more examples
Theresa is the mother of John and Mary
John likes oranges but he doesn’t like apples
9
Mary is studying pharmacy or medicine
CSA4006-Dr. Anirban Bhowmick
FOPL more examples
Everyone likes Venice
Horses are mammals which are animals
10
All that John inherited was a book
John inherited all of the books
CSA4006-Dr. Anirban Bhowmick
FOPL more examples
Existential quantifier- ∃x p(x) and is read as: there exists one x such as p(x) or there is atleast one x
such as p(x)
There is at least one bird in the forest
John and Mary are siblings
11
There is one person who likes salad
Everyone likes someone and no one likes everyone
CSA4006-Dr. Anirban Bhowmick
FOPL more examples
The negation connectives and the quantifiers have the highest priority. Then come the connectives of
conjunction and disjunction. After that, implication, and finally the biconditional has the lowest priority.
Similar formulae:
∀x ¬ P ¬ ∃x P
12
Example:
Nobody likes John : ∀x ¬ like(x,John) ¬ ∃x like(x,John)
¬ ∀x P ∃x ¬ P
Example:
There is at least one person who doesnot like John: ¬ ∀x like (x,John) ∃x ¬ like
(x,John)
CSA4006-Dr. Anirban Bhowmick
FOPL more examples
Similar formulae:
∀x P ¬ ∃x ¬ P
Example:
Everyone likes John : ∀x like(x,John) ¬ ∃x ¬ like(x,John)
13
∃x P ¬ ∀x ¬ P
Example:
There is at least one person who likes John: ¬ ∃x like (x,John) ¬ ∀x ¬ like (x,John)
CSA4006-Dr. Anirban Bhowmick
Syntax Driven Semantic Analysis
• How meaning representations are created
• Syntax driven semantic analysis is a computational approach to semantic analysis that
uses static knowledge from the lexicon and the grammer.
• Based on Principle of compositionality The key idea is that the meaning of a
sentence can be composed from the meanings of it parts
• The meaning of a sentence not based solely on the words that make it up
• It is based on the ordering, grouping, and relations among the words in the sentence 14
• This analysis is then passed as input to a semantic analyzer to produce a meaning
representation
CSA4006-Dr. Anirban Bhowmick
Syntax Driven Semantic Analysis
Franco likes Frasca.
15
CSA4006-Dr. Anirban Bhowmick
Steps in semantic representation
1. Find meaning representation corresponding to verb nominates
- it is the verb whose meaning defines the meaning of the whole sentence
- The meaning representation of the verb acts as the template for meaning
representation of the whole sentence
- The NPs are arguments to the verb and are filled in the template based on their roles
2. Find meaning representation for the two NPs 16
3. Bind the meaning representation of the NPs to the variables in the meaning
representation of the verb to get the meaning representation of the whole sentence
CSA4006-Dr. Anirban Bhowmick
Parse tree to Meaning Representation
How is the mapping from parse tree to meaning representation done?
Augment the lexicon and grammar rules with semantic attachment – devise a mapping between
rules of the grammar and rules of semantic representation (rule to rule hypothesis)
An augmented rule can take the form
17
The text appearing within brackets specifies the meaning representation assigned to A as a function of
the semantic attachment of A’s constituents
CSA4006-Dr. Anirban Bhowmick
Contd.
President nominates speaker
Noun President {President}
Noun Speaker {speaker}
{President} and {speaker} are meaning associated with the augmented rules
18
NP -> Noun {𝑵𝒐𝒖𝒏𝒔𝒆𝒎 }
Verb nominates (∃e,x,y nomination (e) ∧ nominator (e,x) ∧ nominee (e,y))
VP verb NP {𝑽𝒆𝒓𝒃𝒔𝒆𝒎 {𝑵𝑷𝒔𝒆𝒎 }}
To combine 𝑵𝑷𝒔𝒆𝒎 and 𝒗𝒆𝒓𝒃𝒔𝒆𝒎 , y has to be replaced with speaker, not specified in 𝒗𝒆𝒓𝒃𝒔𝒆𝒎 .
Need to revise the semantic attachment for verb
CSA4006-Dr. Anirban Bhowmick
Example:
19
CSA4006-Dr. Anirban Bhowmick
Example
20
CSA4006-Dr. Anirban Bhowmick
Compositionality
How do we know how to construct the VP?
love(?, mary) OR love(mary, ?)
How can we specify in which way the bits &
pieces combine?
21
The meaning of the sentence is constructed
from:
● the meaning of the words (i.e., the lexicon)
● paralleling the syntactic construction (i.e.,
the semantic rules)
CSA4006-Dr. Anirban Bhowmick
22
EEE1001-Dr. Anirban Bhowmick
Natural Language
Processing
CSA4006
Dr. Anirban Bhowmick
Assistant Professor
VIT Bhopal
Lecture : 16
Syllabus
Syllabus
3
Module 4:
Semantics: Computational Desiderata for
Representations- Meaning Structure of
Language- First Order Predicate Calculus-
Elements of FOPC- The Semantics of FOPC- 4
Syntax-Driven Semantic Analysis- Attachments
for a Fragment of English.
Text Books:
Daniel Jurafsky and James H. Martin "Speech and
Language Processing: An Introduction to Natural
Language Processing, Computational Linguistics and
Speech recognition", Prentice Hall, 2nd edition, 2008. 5
Reference Books:
1. Roland R. Hausser "Foundations of Computational
Linguistics: Human- Computer Communication in
Natural Language", Paperback, MIT Press, 2011.
2. Christopher D. Manning and Hinrich Schuetze,
6
"Foundations of Statistical Natural Language
Processing" by MIT Press.
Module 4:
Semantic
Topic: Introduction
Review
CSA4006-Dr. Anirban Bhowmick
Example:
CSA4006-Dr. Anirban Bhowmick
Example
10
CSA4006-Dr. Anirban Bhowmick
Compositionality
How do we know how to construct the VP?
love(?, mary) OR love(mary, ?)
How can we specify in which way the bits &
pieces combine?
11
The meaning of the sentence is constructed
from:
● the meaning of the words (i.e., the lexicon)
● paralleling the syntactic construction (i.e.,
the semantic rules)
CSA4006-Dr. Anirban Bhowmick
Lambda Calculus
Loves(?,Mary)
Add a new operator λ to bind free variables
λx.love(x, mary) loves Mary
Gluing together formulae/terms with function application
12
(λx.love(x, mary)) @ john
(λx.love(x, mary))(john)
CSA4006-Dr. Anirban Bhowmick
Lambda Calculus
Lambda Calculus used to combine semantic representations systematically
Lambda Calculus is an extension of FOPC
The following three rules define how to build all syntactically valid lambda terms
Eg: λx.P(x)(Taj) P(Taj)
Replaces the variable x with Taj and removes λ 13
With λ calculus, VP semantics problem can be solved
CSA4006-Dr. Anirban Bhowmick
Beta reduction
(λx.love(x, mary)) (john)
1. Strip off the λ prefix
(love(x, mary)) (john)
2. Remove the argument
14
love(x, mary)
3. Replace all occurrences of λ-bound variable by argument
love(john, mary)
CSA4006-Dr. Anirban Bhowmick
Rules
Rule 1 If ∝ is a terminal node, then [| ∝ |] is specified in the lexicon
Rule 2 if [| ∝ |] is a non-branching node, and 𝛽 is its daughter node then [| ∝ |] = [| 𝛽 |]
Rule 3 if ∝ is a branching node, {𝛽, 𝛾} is the set of daughters and [| 𝛽 |] is a function whose domain
contain [| 𝛾 |], then [| ∝ |] = [|𝛽|] ([|𝛾|] )
15
Lexical Entries:
(i) Proper Names: as it is
(ii) Intransitive Verbs: [|dies|] = 𝜆𝑥 .x dies
(iii) Transitive Verbs: [|loves|] = 𝜆𝑦 𝜆𝑥 . x loves y
CSA4006-Dr. Anirban Bhowmick
Types
Types are important differentiator for the semanticists
e : individual (Proper nouns)
T : truth values (0,1)
If we have two entities <𝜎, 𝜏> is a type as well it is called function
We can write it as : f(e) t or f: 𝐷𝑒 𝐷𝑡 16
Types of different parts
S=t
N=e
VP = {find subject, output a truth value} <input, output> , <e,t>
V = {find object, find subject, output truth values} <e,<e,t>>
CSA4006-Dr. Anirban Bhowmick
Semantic Construction with Lambdas
17
CSA4006-Dr. Anirban Bhowmick
Adjective
18
CSA4006-Dr. Anirban Bhowmick
Preposition
19
CSA4006-Dr. Anirban Bhowmick
Negate
20
CSA4006-Dr. Anirban Bhowmick
21
EEE1001-Dr. Anirban Bhowmick
Natural Language
Processing
CSA4006
Dr. Anirban Bhowmick
Assistant Professor
VIT Bhopal
Lecture : 17
Syllabus
Syllabus
3
Module 4:
Semantics: Computational Desiderata for
Representations- Meaning Structure of
Language- First Order Predicate Calculus-
Elements of FOPC- The Semantics of FOPC- 4
Syntax-Driven Semantic Analysis- Attachments
for a Fragment of English.
Text Books:
Daniel Jurafsky and James H. Martin "Speech and
Language Processing: An Introduction to Natural
Language Processing, Computational Linguistics and
Speech recognition", Prentice Hall, 2nd edition, 2008. 5
Reference Books:
1. Roland R. Hausser "Foundations of Computational
Linguistics: Human- Computer Communication in
Natural Language", Paperback, MIT Press, 2011.
2. Christopher D. Manning and Hinrich Schuetze,
6
"Foundations of Statistical Natural Language
Processing" by MIT Press.
Module 4:
Semantic
Review
CSA4006-Dr. Anirban Bhowmick
Attachments for a Fragment of English
The sentences with imperatives, Yes/No questions, and WH questions Let’s start by
considering the following examples
Flight 487 serves lunch convey factual information to a hearer
Serve lunch is a request for an action
9
Does Flight 207 serve lunch? are requests for information
Which flights serve lunch? are requests for information
The meaning representations of these examples all contain propositions concerning the
serving of lunch on flights
They differ with respect to the role that these propositions are intended to serve
CSA4006-Dr. Anirban Bhowmick
Contd.
To capture these differences a set of operators is applied to FOPC sentences
Specifically, the following operators will be applied to the FOPC representations
DCL declaratives
IMP imperatives
YNQ yes no questions
WHQ wh question 10
• The normal interpretation for a representation headed by the DCL operator would be as a factual
statement to be added to the current knowledge base.
• Imperative sentences begin with a verb phrase and lack an overt subject. Because of the missing subject,
the meaning representation for the main verb phrase will consist of a λ expression with an unbound λ
variable representing this missing subject
CSA4006-Dr. Anirban Bhowmick
Contd.
Simply supply a subject to the λ-expression by applying a final λ-reduction to a dummy constant.
The IMP operator can then be applied to this representation as in the following semantic
attachment.
11
Applying this rule
Imperatives can be viewed as a kind of speech act
CSA4006-Dr. Anirban Bhowmick
Contd.
yes-no-questions consist of a sentence initial auxiliary verb, followed by a subject noun phrase and
then a verb phrase.
The following semantic attachment simply ignores the auxiliary, and with the exception of the YNQ 12
operator
Yes or No Questions should be thought as asking the whether the propositional part of its meaning
is true or false given the knowledge currently contained in the knowledge-base.
CSA4006-Dr. Anirban Bhowmick
Contd.
wh-subject-questions ask for specific information about the subject of the sentence rather than
the sentence as a whole.
The following attachment produces a representation that consists of the operator WHQ, the
variable corresponding to the subject of the sentence, and the body of the proposition.
13
CSA4006-Dr. Anirban Bhowmick
Contd.
Such questions can be answered by returning a set of assignments for the subject variable that
make the resulting proposition true with respect to the current knowledge base.
Finally, consider the following wh non subject question.
How can I go from Minneapolis to Long Beach?
The question is not about the subject of the sentence but rather some other argument, or some 14
aspect of the proposition as a whole.
In this case, the representation needs to provide an indication as to what the question is about.
The following attachment provides this information by providing the semantics of the auxiliary as an
argument to the WHQ operator.
CSA4006-Dr. Anirban Bhowmick
15
EEE1001-Dr. Anirban Bhowmick
Natural Language
Processing
CSA4006
Dr. Anirban Bhowmick
Assistant Professor
VIT Bhopal
Lecture : 17
Syllabus
Syllabus
3
Module 5:
Machine Translation And Applications: Basic
Issues in Machine
Translation- Statistical Translation- Word
4
Alignment- Phrase based
Translation- Synchronous Grammars-
Applications of Natural Language
Processing: Spell Check- Summarization-
Language Translation.
Text Books:
Daniel Jurafsky and James H. Martin "Speech and
Language Processing: An Introduction to Natural
Language Processing, Computational Linguistics and
Speech recognition", Prentice Hall, 2nd edition, 2008. 5
Reference Books:
1. Roland R. Hausser "Foundations of Computational
Linguistics: Human- Computer Communication in
Natural Language", Paperback, MIT Press, 2011.
2. Christopher D. Manning and Hinrich Schuetze,
6
"Foundations of Statistical Natural Language
Processing" by MIT Press.
Module 5: MT
What is Machine Translation?
Automatic conversion of text/speech from one natural language to another
Be the change you want to see in the world
8
वह परिवर्तन बनो जो संसाि में दे खना चाहर्े हो
CSA4006-Dr. Anirban Bhowmick
Use cases
Government Translation under the hood
● Administrative requirements ● Cross-lingual Search
● Education ● Cross-lingual Summarization
● Security
● Building multilingual
Enterprise dictionaries
9
● Product manuals
● Customer support
Social
Any multilingual NLP system will
● Travel (signboards, food)
involve some kind of machine
● Entertainment (books, movies, videos)
translation at some level
CSA4006-Dr. Anirban Bhowmick
History of MT
10
CSA4006-Dr. Anirban Bhowmick
History of MT
Georges Artsrouni and Petr Troyanskii received the first-ever patents for MT-like tools in 1933. These
tools were quite rudimentary, especially in comparison to what we think of when we hear the term
“MT” today. They worked by comparing dictionaries in the source and target language
first general purpose electronic computers were not far off on the horizon — in the mid-1940s,
developers like Warren Weaver began to theorize about ways they could use computers to automate
the translation process.
11
Early RBMT systems include the Institute Textile de France’s TITUS and Canada’s METEO system,
among others. And while US-based research certainly slowed down after the ALPAC report, it didn’t
come to a complete stop — SYSTRAN, founded in 1968, utilized RBMT as well, working closely with
the US Air Force for Russian-English translation in the 1970s.
In the 1990s, researchers at IBM developed a renewed interest in MT technology, publishing
research on some of the first SMT systems in 1991. Unlike RBMT, SMT doesn’t require developers
to manually input the rules of each language — instead, SMT engines utilize a bilingual corpus of
text to identify patterns in the languages that could be converted into statistical data.
CSA4006-Dr. Anirban Bhowmick
History of MT
And as electronic computers slowly became more of a household item, so too did MT systems.
SYSTRAN launched the first web-based MT tool in 1997, providing lay people — not just
researchers and language service providers — access to an MT tool. Nearly a decade later, in 2006,
Google launched Google Translate, which was powered by SMT from 2007 until 2016.
In 2003, researchers at the University of Montreal developed a language model based on neural
networks, but it wasn’t until 2014, with the development of the sequence-to-sequence (Seq2Seq)
model, that NMT became a formidable rival for SMT.
After that, NMT quickly became the state-of-the-art MT tool — Google Translate adopted it in 2016. 12
NMT engines use larger corpora than SMT and are more reliable when it comes to translating long
strings of text with complex sentence structures.
Although large language models (LLMs) perform a lot of other functions besides translation, some
thought leaders have presented tools like ChatGPT as the future of localization and, by extension,
MT.
CSA4006-Dr. Anirban Bhowmick
Why should you study Machine Translation?
One of the most challenging problems in Natural Language Processing
Pushes the boundaries of NLP
Involves analysis as well as synthesis
Involves all layers of NLP: morphology, syntax, semantics, pragmatics,
discourse
13
Theory and techniques in MT are applicable to a wide range of other
problems like transliteration, speech recognition and synthesis
CSA4006-Dr. Anirban Bhowmick
Why is Machine Translation interesting?
Language Divergence the great diversity among languages of the world
14
The central problem of MT is to bridge
this language divergence
CSA4006-Dr. Anirban Bhowmick
Language Divergence
Word order: SOV (Hindi), SVO (English), VSO, OSV
E: Argentina won the last World Cup
H: अजें टीना ने पपछला पवश्व कप जीर्ा था
15
Free (Hindi) vs rigid (English) word order
पपछला पवश्व कप अजें टीना ने जीर्ा था (correct)
The last World Cup Argentina won (grammatically incorrect)
The last World Cup won Argentina (meaning changes)
CSA4006-Dr. Anirban Bhowmick
Language Divergence.
Different ways of expressing same concept
water पानी, जल, नीर
16
Language registers
Formal: आप बैठिये Informal: तू बैि
Standard : मझ
ु े डोसा चाठिए Dakhini: मेरे को डोसा िोना
CSA4006-Dr. Anirban Bhowmick
Why is Machine Translation difficult?
● Ambiguity
○ Same word, multiple meanings: मंत्री (minister or chess piece)
○ Same meaning, multiple words: जल, पानी, नीि (water)
● Word Order
○ Underlying deeper syntactic structure 17
○ Phrase structure grammar?
○ Computationally intensive
● Morphological Richness
○ Identifying basic units of words
CSA4006-Dr. Anirban Bhowmick
Approaches to build MT systems
18
CSA4006-Dr. Anirban Bhowmick
Rule-based MT
Rules are written by linguistic experts to analyze the source, generate an intermediate
representation, and generate the target sentence
Depending on the depth of analysis: interlingua or transfer-based MT
19
CSA4006-Dr. Anirban Bhowmick
Vauquois Triangle
Translation approaches can be classified by the depth of linguistic analysis they perform
20
CSA4006-Dr. Anirban Bhowmick
Problems with rule-based MT
Required linguistic expertise to develop systems
Maintenance of system is difficult
Difficult to handle ambiguity
Scaling to a large number of language pairs is not easy
21
CSA4006-Dr. Anirban Bhowmick
Example-based MT
Translation by analogy ⇒ match parts of sentences to known translations and then combine
Input: He buys a book on international politics
1. Phrase fragment matching: (data-driven)
he buys
a book 22
international politics
2. Translation of segments: (data-driven)
वह खिीदर्ा है
एक पकर्ाब
अंर्ि िाष्ट्रीय िाजनीपर्
● Partly rule-based, partly data-
driven. 3. Recombination: (human crafted rules/templates)
● Good methods for matching वह अंर्ि िाष्ट्रीय िाजनीपर् पि एक पकर्ाब खिीदर्ा है
and large corpora did not exist
when proposed
CSA4006-Dr. Anirban Bhowmick
23
EEE1001-Dr. Anirban Bhowmick
Natural Language
Processing
CSA4006
Dr. Anirban Bhowmick
Assistant Professor
VIT Bhopal
Lecture : 18
Topic: Statistical Machine Translation
Syllabus
Syllabus
3
Module 5:
Machine Translation And Applications: Basic
Issues in Machine
Translation- Statistical Translation- Word
4
Alignment- Phrase based
Translation- Synchronous Grammars-
Applications of Natural Language
Processing: Spell Check- Summarization-
Language Translation.
Text Books:
Daniel Jurafsky and James H. Martin "Speech and
Language Processing: An Introduction to Natural
Language Processing, Computational Linguistics and
Speech recognition", Prentice Hall, 2nd edition, 2008. 5
Reference Books:
1. Roland R. Hausser "Foundations of Computational
Linguistics: Human- Computer Communication in
Natural Language", Paperback, MIT Press, 2011.
2. Christopher D. Manning and Hinrich Schuetze,
6
"Foundations of Statistical Natural Language
Processing" by MIT Press.
Module 5: MT
Review
CSA4006-Dr. Anirban Bhowmick
SMT
Parallel corpora are available in several language pairs.
Basic idea: use a parallel corpora as a training set of translation examples
Classis example: IBM work on French-English translation, using the Canadian Hansards.
(1.7 million sentences of 30 words or less in length)
9
Idea goes back to Warren Weaver (1949): suggested applying statistical and cryptanalytic
techniques to translation.
….one naturally wonders if the problem of translation could conceivably be treated as
a problem in cryptography. When I look at an article in Russian, I say: “This is really
written in English, but it has been coded in some strange symbols. I will now proceed
to decode”
(Warren Weaver, 1949, in a letter to Norbert Wiener)
CSA4006-Dr. Anirban Bhowmick
The Noisy Channel Model
Goal: translation system from French to English
Have a model p(e|f) which estimates conditional probability of any English sentence e
given the French sentence f. Use the training corpus to set the parameters.
A Noisy Channel Model has two components:
10
p(e) the language model
p(f|e) the translation model
Giving:
p(e,f) 𝑝 𝑒 𝑝(𝑓|𝑒)
p(e|f) = = =
p(f) 𝑝 𝑒 𝑝(𝑓|𝑒)
and
𝑎𝑟𝑔max 𝑝(𝑒|𝑓) = 𝑎𝑟𝑔max p e p(f|e)
𝑒 𝑒
CSA4006-Dr. Anirban Bhowmick
NCM
11
CSA4006-Dr. Anirban Bhowmick
SMT
Let’s formalize the translation process
We will model translation using a probabilistic model. Why?
- We would like to have a measure of confidence for the translations we learn
- We would like to model uncertainty in translation
12
Model: a simplified and idealized understanding of a physical process
CSA4006-Dr. Anirban Bhowmick
SMT
13
Why use this counter-intuitive way of explaining translation?
● Makes it easier to mathematically represent translation and learn probabilities
● Fidelity and Fluency can be modelled separately
CSA4006-Dr. Anirban Bhowmick
SMT
We have already seen how to learn n-gram language models
14
Let’s see how to learn the translation model 𝑃(𝒇|𝒆)
To learn sentence translation probabilities,
we first need to learn word-level translation probabilities
That is the task of word alignment
CSA4006-Dr. Anirban Bhowmick
Word Alignment
A common use of aligned texts
is the derivation of bilingual
dictionaries and terminology
databases.
This is usually done in two steps.
First the terminology
databases text alignment is
15
extended to a word alignment
(unless we are dealing with an
approach in which word and text
alignment are induced
simultaneously).
Then some criteria such as
frequency is used to select
aligned. Given a parallel sentence pair, find word level
correspondences
CSA4006-Dr. Anirban Bhowmick
Contd.
16
CSA4006-Dr. Anirban Bhowmick
Contd.
17
CSA4006-Dr. Anirban Bhowmick
Contd.
18
CSA4006-Dr. Anirban Bhowmick
Contd.
19
CSA4006-Dr. Anirban Bhowmick
Contd.
If we knew the alignments, we could compute P(f|e)
20
CSA4006-Dr. Anirban Bhowmick
21
EEE1001-Dr. Anirban Bhowmick
Natural Language
Processing
CSA4006
Dr. Anirban Bhowmick
Assistant Professor
VIT Bhopal
Lecture : 19
Topic: Statistical Machine Translation
Syllabus
Syllabus
3
Module 5:
Machine Translation And Applications: Basic
Issues in Machine
Translation- Statistical Translation- Word
4
Alignment- Phrase based Translation-
Synchronous Grammars- Applications of Natural
Language Processing: Spell Check-
Summarization- Language Translation.
Text Books:
Daniel Jurafsky and James H. Martin "Speech and
Language Processing: An Introduction to Natural
Language Processing, Computational Linguistics and
Speech recognition", Prentice Hall, 2nd edition, 2008. 5
Reference Books:
1. Roland R. Hausser "Foundations of Computational
Linguistics: Human- Computer Communication in
Natural Language", Paperback, MIT Press, 2011.
2. Christopher D. Manning and Hinrich Schuetze,
6
"Foundations of Statistical Natural Language
Processing" by MIT Press.
Module 5: MT
Review
CSA4006-Dr. Anirban Bhowmick
Word Alignment
A common use of aligned texts
is the derivation of bilingual
dictionaries and terminology
databases.
This is usually done in two steps.
First the terminology
databases text alignment is
9
extended to a word alignment
(unless we are dealing with an
approach in which word and text
alignment are induced
simultaneously).
Then some criteria such as
frequency is used to select
aligned. Given a parallel sentence pair, find word level
correspondences
CSA4006-Dr. Anirban Bhowmick
Contd.
10
CSA4006-Dr. Anirban Bhowmick
Contd.
11
CSA4006-Dr. Anirban Bhowmick
Contd.
12
CSA4006-Dr. Anirban Bhowmick
Contd.
13
CSA4006-Dr. Anirban Bhowmick
Contd.
If we knew the alignments, we could compute P(f|e)
14
CSA4006-Dr. Anirban Bhowmick
IBM Model
IBM Model 1 is a statistical machine translation model that aims to align words between a source
language and a target language. The model is designed to learn the probabilities of word
alignments based on observed parallel sentences in bilingual corpora. The primary goal is to
understand how words in the source language correspond to words in the target language.\
Alignments:
15
CSA4006-Dr. Anirban Bhowmick
IBM Model
16
CSA4006-Dr. Anirban Bhowmick
IBM Model
17
CSA4006-Dr. Anirban Bhowmick
IBM Model
18
CSA4006-Dr. Anirban Bhowmick
Alignments in the IBM Models
19
CSA4006-Dr. Anirban Bhowmick
Alignments in the IBM Models
In IBM Model 1 all alignments a are equally likely:
20
Next step: come up with an estimate for
In Model 1, this is:
CSA4006-Dr. Anirban Bhowmick
IBM Model 1: Example
21
CSA4006-Dr. Anirban Bhowmick
Alignments in the IBM Models
22
CSA4006-Dr. Anirban Bhowmick
Alignments in the IBM Models
23
CSA4006-Dr. Anirban Bhowmick
Phase Based Translation
Phrase-based machine translation is an approach to machine translation that focuses on translating
smaller units of text, typically phrases or short sequences of words, rather than translating word by
word. This approach allows for more flexibility in capturing linguistic variations and improves the overall
translation quality.
• Word-Based Models translate words as atomic units
• Phrase-Based Models translate phrases as atomic units
24
• Foreign input is segmented in
phrases
• Each phrase is translated into
English
• Phrases are reordered
CSA4006-Dr. Anirban Bhowmick
Phrase Translation Table
Main knowledge source: table with phrase translations and their probabilities
Example: phrase translations for natuerlich
25
CSA4006-Dr. Anirban Bhowmick
Phrase Translation Table
Phrase translations for den Vorschlag learned from the Europarl corpus:
26
– lexical variation (proposal vs suggestions)
– morphological variation (proposal vs proposals)
– included function words (the, a, ...)
– noise (it)
CSA4006-Dr. Anirban Bhowmick
Linguistic Phrases?
Model is not limited to linguistic phrases
(noun phrases, verb phrases, prepositional phrases, ...)
• Example non-linguistic phrase pair
spass am → fun with the
• Prior noun often helps with translation of preposition 27
• Experiments show that limitation to linguistic phrases hurts quality
CSA4006-Dr. Anirban Bhowmick
Probabilistic Model
28
CSA4006-Dr. Anirban Bhowmick
Distance-Based Reordering
29
CSA4006-Dr. Anirban Bhowmick
Learning a Phrase Translation Table
• Three stages:
– word alignment: using IBM models or other method
– extraction of phrase pairs
– scoring phrase pairs
30
CSA4006-Dr. Anirban Bhowmick
Learning a Phrase Translation Table
31
All words of the phrase pair have to align to each other
CSA4006-Dr. Anirban Bhowmick
Learning a Phrase Translation Table
32
CSA4006-Dr. Anirban Bhowmick
Scoring Phrase Translations
• Phrase pair extraction: collect all phrase pairs from the data
• Phrase pair scoring: assign probabilities to phrase translations
• Score by relative frequency:
33
CSA4006-Dr. Anirban Bhowmick
34
EEE1001-Dr. Anirban Bhowmick
Natural Language
Processing
CSA4006
Dr. Anirban Bhowmick
Assistant Professor
VIT Bhopal
Lecture : 20
Topic: Neural Machine Translation
Syllabus
Syllabus
3
Module 5:
Machine Translation And Applications: Basic
Issues in Machine
Translation- Statistical Translation- Word
4
Alignment- Phrase based Translation-
Synchronous Grammars- Applications of Natural
Language Processing: Spell Check-
Summarization- Language Translation.
Text Books:
Daniel Jurafsky and James H. Martin "Speech and
Language Processing: An Introduction to Natural
Language Processing, Computational Linguistics and
Speech recognition", Prentice Hall, 2nd edition, 2008. 5
Reference Books:
1. Roland R. Hausser "Foundations of Computational
Linguistics: Human- Computer Communication in
Natural Language", Paperback, MIT Press, 2011.
2. Christopher D. Manning and Hinrich Schuetze,
6
"Foundations of Statistical Natural Language
Processing" by MIT Press.
Module 5: MT
Review
CSA4006-Dr. Anirban Bhowmick
MT
CSA4006-Dr. Anirban Bhowmick
Encoder Decoder Model
Encoder: Takes in the input sentence and produces a
fixed-size context vector.
Decoder: Takes the context vector and generates the 10
output sentence in the target language.
CSA4006-Dr. Anirban Bhowmick
Contd.
11
CSA4006-Dr. Anirban Bhowmick
Contd.
The encoder
Layers of recurrent units where, in each time step, an input token is received, collecting relevant
information and producing a hidden state. This depends on the type of RNN; in our example, a LSTM,
the unit mixes the current hidden state and the input and returns an output, discarded, and a new
hidden state.
The encoder vector
The encoder vector is the last hidden state of the encoder, and it tries to contain as much of the useful 12
input information as possible to help the decoder get the best results. It’s the only information from the
input that the decoder will get.
The decoder
Layers of recurrent units — e.g., LSTMs — where each unit produces an output at a time step t. The
hidden state of the first unit is the encoder vector, and the rest of the units accept the hidden state
from the previous unit. The output is calculated using a softmax function to obtain a probability for
every token in the output vocabulary.
CSA4006-Dr. Anirban Bhowmick
Problem and Solution
Why? Longer sentences illustrate the limitations of a single-directional encoder-decoder architecture.
Because language consists of tokens and grammar, the problem with this model is it does not entirely
address the complexity of the grammar.
Specifically, when translating the nth word in the source language, the RNN was considering only the
1st n-word in the source sentence, but grammatically, the meaning of a word depends on both the 13
sequence of words before and after it in a sentence:
A solution: The bi-directional LSTM model. If we use a bi-directional model, it allows us to input the
context of both past and future words to create an accurate encoder output vector:
CSA4006-Dr. Anirban Bhowmick
Bi-LSTM
14
But then, the challenge then becomes, which word do we need to focus on in a sequence?
CSA4006-Dr. Anirban Bhowmick
Attention Mechanism
Attention Mechanism Overview:
The attention mechanism enhances
the traditional encoder-decoder
architecture by allowing the decoder
to "pay attention" to different parts of
the source sentence when generating
each word in the target sequence.
15
CSA4006-Dr. Anirban Bhowmick
Attention Mechanism
16
CSA4006-Dr. Anirban Bhowmick
Video Tutorials
17
CSA4006-Dr. Anirban Bhowmick
18
EEE1001-Dr. Anirban Bhowmick