Language Processing
COMP3411/9814: Artificial Intelligence
Lecture Overview
• Introduction
• Formal languages and grammars
• Regular expressions
• Minimum edit distance and words
• Natural language modelling: N-gram models
Lecture Overview
• Introduction
• Formal languages and grammars
• Regular expressions
• Minimum edit distance and words
• Natural language modelling: N-gram models
Introduction
• NLP applications:
• Chatbots (customer service), personal assistant (Siri, Alexa),
machine translation, social robotics (home).
• Central problem – Ambiguity:
• Ambiguity makes it difficult to interpret meaning.
• For instance, “The boy saw a girl with a telescope”.
Introduction
Reference resolution:
Jack lost his wallet in his car.
He looked for it for several hours.
Jack forgot his wallet.
Sam did too.
Jack forgot his wallet.
He looked for someone to borrow money from.
Sam did too.
I saw two bears.
Bill saw some too.
Introduction
Discourse Structure
E: So you have the engine assembly finished.
Now attach the rope to the top of the engine.
By the way, did you buy petrol today?
A: Yes. I got some when I bought the new lawnmower wheel.
I forgot to take my can with me, so I bought a new one.
E: Did it cost much?
A: No, and I could use another anyway.
E: OK. Have you got it attached yet?
Tracking focus isn’t enough
Introduction
Hierarchical Structure
SEG1
Jack and Sue went to buy a new lawnmower since their old one was stolen.
SEG2
Sue had seen the man who took it and she had chased him down the street,
but he’d driven away in a truck.
After looking in the store, they realised they couldn’t afford one.
SEG3
By the way, Jack lost his job last month so he’s been short of cash recently.
He has been looking for a new one, but so far hasn’t had any luck.
Anyway, they finally found a used one at a garage sale.
Lecture Overview
• Introduction
• Formal languages and grammars
• Regular expressions
• Minimum edit distance and words
• Natural language modelling: N-gram models
Grammars
• A grammar rule is a formal device for defining sets of sequences of symbols.
• Sequence may represent a statement in a programming language.
• Sequence may be a sentence in a natural language such as English.
• Formally, a grammar is a 4-tuple as: G = <V, Σ, R, S>:
• V is a finite set of non-terminal symbols.
• Σ is a finite set of terminal symbols.
• R is a set of relations defined in V x (V U Σ)*.
• S is the start symbol.
Notation for Rules
• A grammar specification consists of production rules, such as:
<S> ::= a b
<S> ::= a <S> b
• First rule says that whenever S appears in a string, it can be rewritten with the
sequence ab.
• Second rule says that S can be rewritten with a followed by S followed by b.
• S is a non-terminal symbol, a and b are terminal symbols.
• A grammar rule can generate a string, e.g.:
SaSb
aSbaaSbb
aaSbbaaabbb
A Simple Subset of English
sentence --> noun_phrase, verb_phrase
noun_phrase --> determiner, noun
verb_phrase --> verb, noun_phrase
Examples of derivations:
determiner --> [a]
the cat scares the mouse
determiner --> [the]
the mouse hates the cat
noun --> [cat]
noun --> [mouse]
the mouse scares the mouse
verb --> [scares]
verb --> [hates]
Parse Trees
sentence
• Leaves are labelled by the
terminal symbols of the
grammar. noun_phrase verb_phrase
• Internal nodes are labelled by determiner noun verb noun_phrase
non-terminals.
• The parent-child relation is the cat scares determiner noun
specified by the rules of the
grammar. the mouse
Typical (Small) Grammar
S → NP VP
NP → [Det] Adj ∗ N [AP | PP | Rel Clause]∗
VP → V [NP] [NP] PP ∗
AP → Adj PP
PP → P NP
Det → a | an | the | . . .
N → John | Mary | park | telescope | . . .
V → saw | likes | believes | . . .
Adj → hot | hotter | . . .
P → in | with | . . .
∗
Extra notation: is “0 or more”; [ . . ] is “optional”
13
Leftmost Derivation Example
S S → NP VP
⇒ NP VP NP → [Det] Adj ∗ N [AP | PP | Rel Clause]∗
VP → V [NP] [NP] P P ∗
⇒ N VP AP → Adj PP
⇒ John VP PP → P NP
⇒ John V NP PP Det → a | an | the | …
⇒ John saw NP PP N → John | Mary | park | telescope | . . .
V → saw | likes | believes | . . .
⇒ John saw N PP Adj → hot | hotter | . . .
⇒ John saw Mary PP P → in | with | . . .
⇒ John saw Mary P NP
⇒ John saw Mary with NP
⇒ John saw Mary with Det N
⇒ John saw Mary with a N
⇒ John saw Mary with a telescope
⇒ means “rewrites as”
14
Rightmost Derivation
S S → NP VP
NP → [Det] Adj ∗ N [AP | PP | Rel Clause]∗
⇒ NP VP VP → V [NP] [NP] P P ∗
⇒ NP V NP PP AP → Adj PP
PP → P NP
⇒ NP V NP P NP Det → a | an | the | …
N → John | Mary | park | telescope | . . .
⇒ NP V NP P Det N V → saw | likes | believes | . . .
⇒ NP V NP P Det telescope Adj → hot | hotter | . . .
P → in | with | . . .
⇒ NP V NP P a telescope
⇒ ...
⇒ ...
⇒ ...
15
Chomsky’s Hierarchy
• Grammatical formalisms can be classified
by their generative capacity.
• Four classes of grammatical formalisms
that differ only in the form of the rewrite
rules.
• The classes can be arranged in a
hierarchy.
Chomsky’s Hierarchy
• Unrestricted grammars: both sides of the rewrite rules can have any number of terminal
and nonterminal symbols, as in the rule A B C → D E.
• Context-sensitive grammars: the right-hand side must contain at least as many symbols
as the left-hand side. The name “context-sensitive” comes from the fact that a rule such as
A X B → A Y B says that an X can be rewritten as a Y in the context of a preceding A and
a following B. Context-sensitive grammars can represent languages such as anbncn.
• Context-free grammars: the left-hand side consists of a single non-terminal symbol.
Thus, each rule licenses rewriting the nonterminal as the right-hand side in any context.
Context-free grammars can represent anbn, but not anbncn.
• Regular grammars: every rule has a single non-terminal on the left-hand side and a
terminal symbol optionally followed by a non-terminal on the right-hand side. They cannot
represent anbn. The closest they can come is representing a*b*, a sequence of any number
of a’s followed by any number of b’s.
Lecture Overview
• Introduction
• Formal languages and grammars
• Regular expressions
• Minimum edit distance
• Natural language modelling: N-gram models
Regular expressions
A formal language for specifying text strings
How can we search for any of these?
◦ woodchuck
◦ woodchucks
◦ Woodchuck
◦ Woodchucks
◦ Sophisticated sequences of regular
expressions are often the first model for
any text processing text.
Regular Expressions: Disjunctions
Letters inside square brackets []
Pattern Matches
[wW]oodchuck Woodchuck, woodchuck
[1234567890] Any digit
Ranges [A-Z]
Pattern Matches
[A-Z] An upper case letter Drenched Blossoms
[a-z] A lower case letter my beans were impatient
[0-9] A single digit Chapter 1: Down the Rabbit Hole
Regular Expressions: Negation in Disjunction
Negations [^Ss]
◦ Carat means negation only when first in []
Pattern Matches
[^A-Z] Not an upper case Oyfn pripetchik
letter
[^Ss] Neither ‘S’ nor ‘s’ I have no exquisite reason”
[e^] Either e or ^ Look here
a^b The pattern a carat b Look up a^b now
Regular Expressions: More Disjunction
Woodchuck is another name for groundhog!
The pipe | for disjunction
Pattern Matches
groundhog|woodchuck woodchuck
yours|mine yours
a|b|c = [abc]
[gG]roundhog|[Ww]oodchuck Woodchuck
Regular Expressions: ? *+.
Pattern Matches
colou?r Optional color colour
previous char
oo*h! 0 or more of oh! ooh! oooh! ooooh!
previous char
o+h! 1 or more of oh! ooh! oooh! ooooh!
previous char
baa+ baa baaa baaaa baaaaa
Stephen C Kleene
beg.n begin begun begun beg3n
Kleene *, Kleene +
Regular Expressions: Anchors ^ $
Pattern Matches
^[A-Z] Palo Alto
^[^A-Za-z] 1 “Hello”
\.$ The end.
.$ The end? The end!
Example
Find me all instances of the word “the” in a text.
the
Misses capitalized examples
[tT]he
Incorrectly returns other or theology
[^a-zA-Z][tT]he[^a-zA-Z]
Incorrectly when it begins or finishes a line
(^|[^a-zA-Z])[tT]he([^a-zA-Z]|$)
Errors
The process we just went through was based on fixing two kinds of
errors:
1. Matching strings that we should not have matched (there, then,
other)
False positives (Type I errors)
2. Not matching things that we should have matched (The)
False negatives (Type II errors)
Errors cont.
In NLP we are always dealing with these kinds of errors.
Reducing the error rate for an application often involves two
antagonistic efforts:
◦ Increasing accuracy or precision (minimizing false positives)
◦ Increasing coverage or recall (minimizing false negatives).
Substitutions
Substitution in Python and UNIX commands:
s/regexp1/pattern/
e.g.:
s/colour/color/
Capture Groups
• Say we want to put angles around all numbers:
the 35 boxes the <35> boxes
• Use parentheses () to "capture" a pattern into a numbered register (1,
2, 3…)
• Use \1 to refer to the contents of the register
s/([0-9]+)/<\1>/
Capture groups: multiple registers
/the (.*)er they (.*), the \1er we \2/
Matches
the faster they ran, the faster we ran
But not
the faster they ran, the faster we ate
But suppose we don't want to capture?
Parentheses have a double function: grouping terms, and capturing
Non-capturing groups: add a ?: after the first parenthesis:
/(?:some|a few) (people|cats) like some \1/
matches
◦ some cats like some cats
but not
◦ some cats like some some
Simple Application: ELIZA
Early NLP system that imitated a Rogerian psychotherapist
◦ Joseph Weizenbaum, 1966.
Uses pattern matching to match, e.g.,:
◦ “I need X”
and translates them into, e.g.
◦ “What would it mean to you if you got X?
Simple Application: ELIZA
Men are all alike.
IN WHAT WAY
They're always bugging us about something or other.
CAN YOU THINK OF A SPECIFIC EXAMPLE
Well, my boyfriend made me come here.
YOUR BOYFRIEND MADE YOU COME HERE
He says I'm depressed much of the time.
I AM SORRY TO HEAR YOU ARE DEPRESSED
How ELIZA works
s/.* I’M (depressed|sad) .*/I AM SORRY TO HEAR YOU ARE \1/
s/.* I AM (depressed|sad) .*/WHY DO YOU THINK YOU ARE \1/
s/.* all .*/IN WHAT WAY?/
s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE?/
How many words in a sentence?
"I do uh main- mainly business data processing"
◦ Fragments, filled pauses
"Seuss’s cat in the hat is different from other cats!"
◦ Lemma: same stem, part of speech, rough word sense
◦ cat and cats = same lemma
◦ Wordform: the full inflected surface form
◦ cat and cats = different wordforms
How many words in a sentence?
they lay back on the San Francisco grass and looked at the stars and
their
Type: an element of the vocabulary.
Token : an instance of that type in running text.
How many?
◦ 15 tokens (or 14)
◦ 13 types (or 12) (or 11)
Corpora
Words don't appear out of nowhere!
A text is produced by
• a specific writer(s),
• at a specific time,
• in a specific variety,
• of a specific language,
• for a specific function.
Text Normalization
Every NLP task requires text normalization:
1. Tokenizing (segmenting) words
2. Normalizing word formats
3. Segmenting sentences
Space-based tokenization
A very simple way to tokenize
◦ For languages that use space characters between words
◦ Arabic, Cyrillic, Greek, Latin, etc., based writing systems
◦ Segment off a token between instances of spaces
Unix tools for space-based tokenization
◦ The "tr" command
◦ Given a text file, output the word tokens and their frequencies
◦ Remove all the numbers and punctuation.
Issues in Tokenization
Can't just blindly remove punctuation:
◦ m.p.h., Ph.D., AT&T, cap’n
◦ prices ($45.55)
◦ dates (01/02/06)
◦ URLs (http://www.stanford.edu)
◦ hashtags (#nlproc)
◦ email addresses (
[email protected])
Clitic contraction: a word that doesn't stand on its own
◦ "are" in we're, French "je" in j'ai, "le" in l'honneur
When should multiword expressions (MWE) be words?
◦ New York, rock ’n’ roll
Tokenization in NLTK
Tokenization needs to be run before any other language processing. A
standard method is to use deterministic algorithms based on regular
expressions.
Word Normalization
Putting words/tokens in a standard format
◦ U.S.A. or USA
◦ uhhuh or uh-huh
◦ Fed or fed
◦ am, is, be, are
Applications like information retrieval or speech recognition: reduce all
letters to lower case
◦ Since users tend to use lower case
◦ Possible exception: upper case in mid-sentence?
◦ e.g., General Motors
For sentiment analysis or machine translation
◦ Case is helpful (US versus us is important)
Lemmatization
Represent all words as their lemma, their shared root
= dictionary headword form:
◦ am, are, is → be
◦ car, cars, car's, cars' → car
◦ He is reading detective stories
→ He be read detective story
Lemmatization is done by Morphological Parsing
Morphemes:
◦ The small meaningful units that make up words
◦ Stems: The core meaning-bearing units
◦ Affixes: Parts that adhere to stems, often with grammatical functions
Morphological Parsers:
◦ Parse cats into two morphemes cat and s
Porter Stemmer:
◦ Based on a series of rewrite rules run in series. Some sample rules:
Stemming
Reduce terms to stems, chopping off affixes crudely
This was not the map we found Thi wa not the map we found in
in Billy Bones’s chest, but an Billi Bone s chest but an accur
accurate copy, complete in all copi complet in all thing name
things-names and heights and and height and sound with the
soundings-with the single singl except of the red cross
exception of the red crosses and the written note
and the written notes. .
Sentence Segmentation
!, ? mostly unambiguous but period “.” is very ambiguous
◦ Sentence boundary
◦ Abbreviations like Inc. or Dr.
◦ Numbers like .02% or 4.3
Common algorithm: Tokenize first: use rules or ML to classify a period as either
(a) part of the word or (b) a sentence-boundary.
◦ An abbreviation dictionary can help
Sentence segmentation can then often be done by rules based on this
tokenization.
Lecture Overview
• Introduction
• Formal languages and grammars
• Regular expressions
• Minimum edit distance
• Natural language modelling: N-gram models
How similar are two strings?
Spell correction
◦ The user typed “graffe”
Which is closest?
◦ graf
◦ graft
◦ grail
◦ giraffe
• Also for Machine Translation, Information Extraction, Speech Recognition
Edit Distance
The minimum edit distance between two strings is the minimum
number of editing operations:
◦ Insertion
◦ Deletion
◦ Substitution
Needed to transform one into the other
Minimum Edit Distance
Two strings and their alignment:
Minimum Edit Distance
If each operation has cost of 1
◦ Distance between these is 5
If substitutions cost 2 (Levenshtein)
◦ Distance between them is 8
How to find the Min Edit Distance?
Searching for a path (sequence of edits) from the start string to the
final string:
◦ Initial state: the word we’re transforming
◦ Operators: insert, delete, substitute
◦ Goal state: the word we’re trying to get to
◦ Path cost: what we want to minimize: the number of edits
Minimum Edit as Search
But the space of all edit sequences is huge!
◦ We can’t afford to navigate naïvely
◦ Lots of distinct paths wind up at the same state.
◦ We don’t have to keep track of all of them
◦ Just the shortest path to each of those revisited states.
Defining Min Edit Distance
For two strings
◦ X of length n
◦ Y of length m
We define D(i,j)
◦ the edit distance between X[1..i] and Y[1..j]
◦ i.e., the first i characters of X and the first j characters of Y
◦ The edit distance between X and Y is thus D(n,m)
Dynamic Programming for Minimum Edit Distance
Dynamic programming: A tabular computation of D(n,m)
Solving problems by combining solutions to subproblems.
Bottom-up
◦ We compute D(i,j) for small i,j
◦ And compute larger D(i,j) based on previously computed smaller values
◦ i.e., compute D(i,j) for all i (0 < i < n) and j (0 < j < m)
Defining Min Edit Distance (Levenshtein)
Initialization
D(i,0) = i
D(0,j) = j
Recurrence Relation:
For each i = 1…M
For each j = 1…N
D(i-1,j) + 1 (deletion)
D(i,j)= min D(i,j-1) + 1 (insertion)
D(i-1,j-1) + 2; if X(i) ≠ Y(j)
0; if X(i) = Y(j)
Termination:
D(N,M) is distance
The Edit Distance Table
N 9
O 8
I 7
T 6
N 5
E 4
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
The Edit Distance Table
N 9
O 8
I 7
T 6
N 5
E 4
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
The Edit Distance Table
N 9 8 9 10 11 12 11 10 9 8
O 8 7 8 9 10 11 10 9 8 9
I 7 6 7 8 9 10 9 8 9 10
T 6 5 6 7 8 9 8 9 10 11
N 5 4 5 6 7 8 9 10 11 10
E 4 3 4 5 6 7 8 9 10 9
T 3 4 5 6 7 8 7 8 9 8
N 2 3 4 5 6 7 8 7 8 7
I 1 2 3 4 5 6 7 6 7 8
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
Lecture Overview
• Introduction
• Formal languages and grammars
• Regular expressions
• Minimum edit distance
• Natural language modelling: N-gram models
Probabilistic Language Models
Goal: assign a probability to a sentence
◦ Machine Translation:
◦ P(high winds tonite) > P(large winds tonite)
◦ Spell Correction
◦ The office is about fifteen minuets from my house
Why?
◦ P(about fifteen minutes from) > P(about fifteen minuets from)
◦ Speech Recognition
◦ P(I saw a van) >> P(eyes awe of an)
Probabilistic Language Modeling
Goal: compute the probability of a sentence or sequence of words:
P(W) = P(w1,w2,w3,w4,w5…wn)
Related task: probability of an upcoming word:
P(w5|w1,w2,w3,w4)
A model that computes either of these:
P(W) or P(wn|w1,w2…wn-1) is called a language model.
Better: the grammar But language model or LM is standard
How to compute P(W)
How to compute this joint probability:
◦ P(its, water, is, so, transparent, that)
Using conditional probabilities:
p(B|A) = P(A,B)/P(A) Rewriting: P(A,B) = P(A)P(B|A)
More variables:
P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C)
The Chain Rule in General
P(x1,x2,x3,…,xn) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1,…,xn-1)
The Chain Rule applied to compute joint probability
of words in sentence
P(“its water is so transparent”) =
P(its) × P(water|its) × P(is|its water)
× P(so|its water is) × P(transparent|its water is so)
How to estimate these probabilities
Could we just count and divide?
No! Too many possible sentences!
We’ll never see enough data for estimating these
Markov Assumption
Simplifying assumption:
Andrei Markov
Or maybe
Markov Assumption
In other words, we approximate each component in the product
Simplest case: Unigram model
Some automatically generated sentences from a unigram model
fifth, an, of, futures, the, an, incorporated, a, a,
the, inflation, most, dollars, quarter, in, is, mass
thrift, did, eighty, said, hard, 'm, july, bullish
that, or, limited, the
Bigram model
Condition on the previous word:
texaco, rose, one, in, this, issue, is, pursuing, growth, in,
a, boiler, house, said, mr., gurria, mexico, 's, motion,
control, proposal, without, permission, from, five, hundred,
fifty, five, yen
outside, new, car, parking, lot, of, the, agreement, reached
this, would, be, a, record, november
N-gram models
We can extend to trigrams, 4-grams, 5-grams
In general this is an insufficient model of language
◦ because language has long-distance dependencies:
“The computer which I had just put into the machine room on the fifth
floor crashed.”
But we can often get away with N-gram models
Approximating Shakespeare
Palabras claves
References
• Jurafsky, D. & Martin, J. H. Speech and
Language Processing. Stanford. 2023.
Chapters 2 and 3.
• Russell, S.J. & Norvig, P. Artificial
Intelligence: A Modern Approach.
Fourth Edition, Pearson Education,
Hoboken, NJ, 2021. Chapters 22 and
23.
Feedback
• In case you want to provide anonymous
feedback on these lectures, please visit:
• https://forms.gle/KBkN744QuffuAZLF8
Muchas gracias!