0% found this document useful (0 votes)

3 views74 pages

COMP3411 Week 8 - Language Processing

The document provides an overview of language processing in artificial intelligence, covering topics such as formal languages, grammars, regular expressions, and natural language modeling. It discusses key concepts like ambiguity in language, reference resolution, discourse structure, and the hierarchy of grammatical formalisms. Additionally, it highlights the importance of text normalization, tokenization, and lemmatization in natural language processing tasks.

Uploaded by

tianzong Li

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views74 pages

COMP3411 Week 8 - Language Processing

Uploaded by

tianzong Li

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 74

Language Processing

COMP3411/9814: Artificial Intelligence

Lecture Overview

• Introduction

• Formal languages and grammars

• Regular expressions

• Minimum edit distance and words

• Natural language modelling: N-gram models

Lecture Overview

• Introduction

• Formal languages and grammars

• Regular expressions

• Minimum edit distance and words

• Natural language modelling: N-gram models

Introduction

• NLP applications:

• Chatbots (customer service), personal assistant (Siri, Alexa),

machine translation, social robotics (home).

• Central problem – Ambiguity:

• Ambiguity makes it difficult to interpret meaning.

• For instance, “The boy saw a girl with a telescope”.

Introduction
Reference resolution:

Jack lost his wallet in his car.

He looked for it for several hours.

Jack forgot his wallet.

Sam did too.

Jack forgot his wallet.

He looked for someone to borrow money from.
Sam did too.

I saw two bears.

Bill saw some too.
Introduction
Discourse Structure

E: So you have the engine assembly finished.

Now attach the rope to the top of the engine.
By the way, did you buy petrol today?

A: Yes. I got some when I bought the new lawnmower wheel.

I forgot to take my can with me, so I bought a new one.

E: Did it cost much?

A: No, and I could use another anyway.

E: OK. Have you got it attached yet?

Tracking focus isn’t enough

Introduction
Hierarchical Structure

SEG1
Jack and Sue went to buy a new lawnmower since their old one was stolen.

SEG2
Sue had seen the man who took it and she had chased him down the street,
but he’d driven away in a truck.

After looking in the store, they realised they couldn’t afford one.

SEG3
By the way, Jack lost his job last month so he’s been short of cash recently.
He has been looking for a new one, but so far hasn’t had any luck.

Anyway, they finally found a used one at a garage sale.

Lecture Overview

• Introduction

• Formal languages and grammars

• Regular expressions

• Minimum edit distance and words

• Natural language modelling: N-gram models

Grammars

• A grammar rule is a formal device for defining sets of sequences of symbols.

• Sequence may represent a statement in a programming language.
• Sequence may be a sentence in a natural language such as English.
• Formally, a grammar is a 4-tuple as: G = <V, Σ, R, S>:
• V is a finite set of non-terminal symbols.
• Σ is a finite set of terminal symbols.
• R is a set of relations defined in V x (V U Σ)*.
• S is the start symbol.
Notation for Rules
• A grammar specification consists of production rules, such as:
<S> ::= a b
<S> ::= a <S> b
• First rule says that whenever S appears in a string, it can be rewritten with the
sequence ab.
• Second rule says that S can be rewritten with a followed by S followed by b.
• S is a non-terminal symbol, a and b are terminal symbols.
• A grammar rule can generate a string, e.g.:
SaSb
aSbaaSbb
aaSbbaaabbb
A Simple Subset of English
sentence --> noun_phrase, verb_phrase
noun_phrase --> determiner, noun
verb_phrase --> verb, noun_phrase
Examples of derivations:

determiner --> [a]

the cat scares the mouse
determiner --> [the]

the mouse hates the cat

noun --> [cat]
noun --> [mouse]
the mouse scares the mouse

verb --> [scares]

verb --> [hates]
Parse Trees
sentence
• Leaves are labelled by the
terminal symbols of the
grammar. noun_phrase verb_phrase

• Internal nodes are labelled by determiner noun verb noun_phrase

non-terminals.

• The parent-child relation is the cat scares determiner noun

specified by the rules of the

∗
Extra notation: is “0 or more”; [ . . ] is “optional”

⇒ means “rewrites as”

14
Rightmost Derivation

15
Chomsky’s Hierarchy

• Grammatical formalisms can be classified

by their generative capacity.
• Four classes of grammatical formalisms
that differ only in the form of the rewrite
rules.
• The classes can be arranged in a
hierarchy.
Chomsky’s Hierarchy

• Unrestricted grammars: both sides of the rewrite rules can have any number of terminal
and nonterminal symbols, as in the rule A B C → D E.
• Context-sensitive grammars: the right-hand side must contain at least as many symbols
as the left-hand side. The name “context-sensitive” comes from the fact that a rule such as
A X B → A Y B says that an X can be rewritten as a Y in the context of a preceding A and
a following B. Context-sensitive grammars can represent languages such as anbncn.
• Context-free grammars: the left-hand side consists of a single non-terminal symbol.
Thus, each rule licenses rewriting the nonterminal as the right-hand side in any context.
Context-free grammars can represent anbn, but not anbncn.
• Regular grammars: every rule has a single non-terminal on the left-hand side and a
terminal symbol optionally followed by a non-terminal on the right-hand side. They cannot
represent anbn. The closest they can come is representing a*b*, a sequence of any number
of a’s followed by any number of b’s.
Lecture Overview

• Introduction

• Formal languages and grammars

• Regular expressions

• Minimum edit distance

• Natural language modelling: N-gram models

Regular expressions
A formal language for specifying text strings
How can we search for any of these?
◦ woodchuck
◦ woodchucks
◦ Woodchuck
◦ Woodchucks

◦ Sophisticated sequences of regular

expressions are often the first model for
any text processing text.
Regular Expressions: Disjunctions
Letters inside square brackets []

Pattern Matches
[wW]oodchuck Woodchuck, woodchuck
[1234567890] Any digit
Ranges [A-Z]
Pattern Matches
[A-Z] An upper case letter Drenched Blossoms
[a-z] A lower case letter my beans were impatient
[0-9] A single digit Chapter 1: Down the Rabbit Hole
Regular Expressions: Negation in Disjunction

Negations [^Ss]
◦ Carat means negation only when first in []

Pattern Matches
[^A-Z] Not an upper case Oyfn pripetchik
letter
[^Ss] Neither ‘S’ nor ‘s’ I have no exquisite reason”

[e^] Either e or ^ Look here

a^b The pattern a carat b Look up a^b now
Regular Expressions: More Disjunction
Woodchuck is another name for groundhog!
The pipe | for disjunction

Pattern Matches
groundhog|woodchuck woodchuck
yours|mine yours
a|b|c = [abc]
[gG]roundhog|[Ww]oodchuck Woodchuck
Regular Expressions: ? *+.
Pattern Matches
colou?r Optional color colour
previous char
oo*h! 0 or more of oh! ooh! oooh! ooooh!
previous char
o+h! 1 or more of oh! ooh! oooh! ooooh!
previous char
baa+ baa baaa baaaa baaaaa
Stephen C Kleene
beg.n begin begun begun beg3n
Kleene *, Kleene +
Regular Expressions: Anchors ^ $

Pattern Matches
^[A-Z] Palo Alto
^[Â-Za-z] 1 “Hello”
\.$ The end.
.$ The end? The end!
Example
Find me all instances of the word “the” in a text.
the
Misses capitalized examples
[tT]he
Incorrectly returns other or theology
[â-zA-Z][tT]he[â-zA-Z]
Incorrectly when it begins or finishes a line
(^|[â-zA-Z])[tT]he([â-zA-Z]|$)
Errors
The process we just went through was based on fixing two kinds of
errors:

1. Matching strings that we should not have matched (there, then,

other)
False positives (Type I errors)

2. Not matching things that we should have matched (The)

False negatives (Type II errors)
Errors cont.
In NLP we are always dealing with these kinds of errors.
Reducing the error rate for an application often involves two
antagonistic efforts:

◦ Increasing accuracy or precision (minimizing false positives)

◦ Increasing coverage or recall (minimizing false negatives).
Substitutions
Substitution in Python and UNIX commands:

s/regexp1/pattern/
e.g.:
s/colour/color/
Capture Groups
• Say we want to put angles around all numbers:
the 35 boxes  the <35> boxes

• Use parentheses () to "capture" a pattern into a numbered register (1,

2, 3…)

• Use \1 to refer to the contents of the register

s/([0-9]+)/<\1>/
Capture groups: multiple registers

/the (.)er they (.), the \1er we \2/

Matches
the faster they ran, the faster we ran
But not
the faster they ran, the faster we ate
But suppose we don't want to capture?
Parentheses have a double function: grouping terms, and capturing
Non-capturing groups: add a ?: after the first parenthesis:
/(?:some|a few) (people|cats) like some \1/
matches
◦ some cats like some cats
but not
◦ some cats like some some
Simple Application: ELIZA

Early NLP system that imitated a Rogerian psychotherapist

◦ Joseph Weizenbaum, 1966.

Uses pattern matching to match, e.g.,:

◦ “I need X”
and translates them into, e.g.
◦ “What would it mean to you if you got X?
Simple Application: ELIZA

Men are all alike.

IN WHAT WAY
They're always bugging us about something or other.
CAN YOU THINK OF A SPECIFIC EXAMPLE
Well, my boyfriend made me come here.
YOUR BOYFRIEND MADE YOU COME HERE
He says I'm depressed much of the time.
I AM SORRY TO HEAR YOU ARE DEPRESSED
How ELIZA works

s/.* I’M (depressed|sad) .*/I AM SORRY TO HEAR YOU ARE \1/

s/.* I AM (depressed|sad) .*/WHY DO YOU THINK YOU ARE \1/
s/.* all .*/IN WHAT WAY?/
s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE?/
How many words in a sentence?
"I do uh main- mainly business data processing"
◦ Fragments, filled pauses

"Seuss’s cat in the hat is different from other cats!"

◦ Lemma: same stem, part of speech, rough word sense
◦ cat and cats = same lemma
◦ Wordform: the full inflected surface form
◦ cat and cats = different wordforms
How many words in a sentence?

they lay back on the San Francisco grass and looked at the stars and
their

Type: an element of the vocabulary.

Token : an instance of that type in running text.
How many?
◦ 15 tokens (or 14)
◦ 13 types (or 12) (or 11)
Corpora
Words don't appear out of nowhere!
A text is produced by
• a specific writer(s),
• at a specific time,
• in a specific variety,
• of a specific language,
• for a specific function.
Text Normalization

Every NLP task requires text normalization:

1. Tokenizing (segmenting) words

2. Normalizing word formats
3. Segmenting sentences
Space-based tokenization
A very simple way to tokenize
◦ For languages that use space characters between words
◦ Arabic, Cyrillic, Greek, Latin, etc., based writing systems
◦ Segment off a token between instances of spaces

Unix tools for space-based tokenization

◦ The "tr" command
◦ Given a text file, output the word tokens and their frequencies
◦ Remove all the numbers and punctuation.
Issues in Tokenization
Can't just blindly remove punctuation:
◦ m.p.h., Ph.D., AT&T, cap’n
◦ prices ($45.55)
◦ dates (01/02/06)
◦ URLs (http://www.stanford.edu)
◦ hashtags (#nlproc)
◦ email addresses ([email protected])
Clitic contraction: a word that doesn't stand on its own
◦ "are" in we're, French "je" in j'ai, "le" in l'honneur
When should multiword expressions (MWE) be words?
◦ New York, rock ’n’ roll
Tokenization in NLTK
Tokenization needs to be run before any other language processing. A
standard method is to use deterministic algorithms based on regular
expressions.
Word Normalization
Putting words/tokens in a standard format
◦ U.S.A. or USA
◦ uhhuh or uh-huh
◦ Fed or fed
◦ am, is, be, are
Applications like information retrieval or speech recognition: reduce all
letters to lower case
◦ Since users tend to use lower case
◦ Possible exception: upper case in mid-sentence?
◦ e.g., General Motors
For sentiment analysis or machine translation
◦ Case is helpful (US versus us is important)
Lemmatization

Represent all words as their lemma, their shared root

= dictionary headword form:
◦ am, are, is → be
◦ car, cars, car's, cars' → car

◦ He is reading detective stories

→ He be read detective story
Lemmatization is done by Morphological Parsing
Morphemes:
◦ The small meaningful units that make up words
◦ Stems: The core meaning-bearing units
◦ Affixes: Parts that adhere to stems, often with grammatical functions
Morphological Parsers:
◦ Parse cats into two morphemes cat and s
Porter Stemmer:
◦ Based on a series of rewrite rules run in series. Some sample rules:
Stemming
Reduce terms to stems, chopping off affixes crudely

This was not the map we found Thi wa not the map we found in
in Billy Bones’s chest, but an Billi Bone s chest but an accur
accurate copy, complete in all copi complet in all thing name
things-names and heights and and height and sound with the
soundings-with the single singl except of the red cross
exception of the red crosses and the written note
and the written notes. .
Sentence Segmentation
!, ? mostly unambiguous but period “.” is very ambiguous
◦ Sentence boundary
◦ Abbreviations like Inc. or Dr.
◦ Numbers like .02% or 4.3
Common algorithm: Tokenize first: use rules or ML to classify a period as either
(a) part of the word or (b) a sentence-boundary.
◦ An abbreviation dictionary can help
Sentence segmentation can then often be done by rules based on this
tokenization.
Lecture Overview

• Introduction

• Formal languages and grammars

• Regular expressions

• Minimum edit distance

• Natural language modelling: N-gram models

How similar are two strings?

Spell correction
◦ The user typed “graffe”
Which is closest?
◦ graf
◦ graft
◦ grail
◦ giraffe

• Also for Machine Translation, Information Extraction, Speech Recognition

Edit Distance
The minimum edit distance between two strings is the minimum
number of editing operations:

◦ Insertion
◦ Deletion
◦ Substitution

Needed to transform one into the other

Minimum Edit Distance
Two strings and their alignment:
Minimum Edit Distance

If each operation has cost of 1

◦ Distance between these is 5

If substitutions cost 2 (Levenshtein)

◦ Distance between them is 8
How to find the Min Edit Distance?
Searching for a path (sequence of edits) from the start string to the
final string:
◦ Initial state: the word we’re transforming
◦ Operators: insert, delete, substitute
◦ Goal state: the word we’re trying to get to
◦ Path cost: what we want to minimize: the number of edits
Minimum Edit as Search
But the space of all edit sequences is huge!
◦ We can’t afford to navigate naïvely
◦ Lots of distinct paths wind up at the same state.
◦ We don’t have to keep track of all of them
◦ Just the shortest path to each of those revisited states.
Defining Min Edit Distance
For two strings
◦ X of length n
◦ Y of length m

We define D(i,j)
◦ the edit distance between X[1..i] and Y[1..j]
◦ i.e., the first i characters of X and the first j characters of Y
◦ The edit distance between X and Y is thus D(n,m)
Dynamic Programming for Minimum Edit Distance
Dynamic programming: A tabular computation of D(n,m)
Solving problems by combining solutions to subproblems.
Bottom-up
◦ We compute D(i,j) for small i,j
◦ And compute larger D(i,j) based on previously computed smaller values
◦ i.e., compute D(i,j) for all i (0 < i < n) and j (0 < j < m)
Defining Min Edit Distance (Levenshtein)
Initialization
D(i,0) = i
D(0,j) = j
Recurrence Relation:
For each i = 1…M
For each j = 1…N
D(i-1,j) + 1 (deletion)
D(i,j)= min D(i,j-1) + 1 (insertion)
D(i-1,j-1) + 2; if X(i) ≠ Y(j)
0; if X(i) = Y(j)
Termination:
D(N,M) is distance
The Edit Distance Table
N 9
O 8
I 7

T 6
N 5
E 4
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
The Edit Distance Table
N 9
O 8
I 7
T 6
N 5
E 4
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
The Edit Distance Table

N 9 8 9 10 11 12 11 10 9 8
O 8 7 8 9 10 11 10 9 8 9
I 7 6 7 8 9 10 9 8 9 10
T 6 5 6 7 8 9 8 9 10 11
N 5 4 5 6 7 8 9 10 11 10
E 4 3 4 5 6 7 8 9 10 9
T 3 4 5 6 7 8 7 8 9 8
N 2 3 4 5 6 7 8 7 8 7
I 1 2 3 4 5 6 7 6 7 8
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
Lecture Overview

• Introduction

• Formal languages and grammars

• Regular expressions

• Minimum edit distance

• Natural language modelling: N-gram models

Probabilistic Language Models

Goal: assign a probability to a sentence

◦ Machine Translation:
◦ P(high winds tonite) > P(large winds tonite)
◦ Spell Correction
◦ The office is about fifteen minuets from my house
Why?
◦ P(about fifteen minutes from) > P(about fifteen minuets from)
◦ Speech Recognition
◦ P(I saw a van) >> P(eyes awe of an)
Probabilistic Language Modeling

Goal: compute the probability of a sentence or sequence of words:

P(W) = P(w1,w2,w3,w4,w5…wn)
Related task: probability of an upcoming word:
P(w5|w1,w2,w3,w4)
A model that computes either of these:
P(W) or P(wn|w1,w2…wn-1) is called a language model.
Better: the grammar But language model or LM is standard
How to compute P(W)
How to compute this joint probability:
◦ P(its, water, is, so, transparent, that)

Using conditional probabilities:

p(B|A) = P(A,B)/P(A) Rewriting: P(A,B) = P(A)P(B|A)

P(“its water is so transparent”) =

P(its) × P(water|its) × P(is|its water)
× P(so|its water is) × P(transparent|its water is so)
How to estimate these probabilities
Could we just count and divide?

No! Too many possible sentences!

We’ll never see enough data for estimating these
Markov Assumption

Simplifying assumption:
Andrei Markov

Or maybe
Markov Assumption

In other words, we approximate each component in the product

Simplest case: Unigram model

Some automatically generated sentences from a unigram model

fifth, an, of, futures, the, an, incorporated, a, a,

the, inflation, most, dollars, quarter, in, is, mass

thrift, did, eighty, said, hard, 'm, july, bullish

that, or, limited, the

Bigram model

Condition on the previous word:

texaco, rose, one, in, this, issue, is, pursuing, growth, in,
a, boiler, house, said, mr., gurria, mexico, 's, motion,
control, proposal, without, permission, from, five, hundred,
fifty, five, yen

outside, new, car, parking, lot, of, the, agreement, reached

this, would, be, a, record, november

N-gram models

We can extend to trigrams, 4-grams, 5-grams

In general this is an insufficient model of language
◦ because language has long-distance dependencies:

“The computer which I had just put into the machine room on the fifth
floor crashed.”

But we can often get away with N-gram models

Approximating Shakespeare
Palabras claves
References

• Jurafsky, D. & Martin, J. H. Speech and

Language Processing. Stanford. 2023.
Chapters 2 and 3.

• Russell, S.J. & Norvig, P. Artificial

Intelligence: A Modern Approach.
Fourth Edition, Pearson Education,
Hoboken, NJ, 2021. Chapters 22 and
23.
Feedback
• In case you want to provide anonymous
feedback on these lectures, please visit:

• https://forms.gle/KBkN744QuffuAZLF8

Muchas gracias!

Java Interview JavaTpoint
100% (1)
Java Interview JavaTpoint
170 pages
(Ebook) Visualization Analysis and Design by Tamara Munzner ISBN 9781466508910, 1466508914 Download
100% (1)
(Ebook) Visualization Analysis and Design by Tamara Munzner ISBN 9781466508910, 1466508914 Download
95 pages
Cit316 Summary From Noungeeks
No ratings yet
Cit316 Summary From Noungeeks
89 pages
Natural Language Processing: Dr. Ahmed El-Bialy
100% (1)
Natural Language Processing: Dr. Ahmed El-Bialy
49 pages
NLP My Lecture
No ratings yet
NLP My Lecture
30 pages
Lecture 2 - NLP-I
No ratings yet
Lecture 2 - NLP-I
91 pages
NLP Unit1Content
No ratings yet
NLP Unit1Content
106 pages
Ss Mod 1.2
No ratings yet
Ss Mod 1.2
69 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
74 pages
Theory of Computation - Practical
No ratings yet
Theory of Computation - Practical
23 pages
How To Crack GATE - IES - BARC - Electronic Devices and Circuits (EDC)
No ratings yet
How To Crack GATE - IES - BARC - Electronic Devices and Circuits (EDC)
4 pages
214 Grammar 2014
No ratings yet
214 Grammar 2014
50 pages
04 - Parsing in NLP
No ratings yet
04 - Parsing in NLP
39 pages
Multimedia Application L2
No ratings yet
Multimedia Application L2
47 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
COMP3411 Slides All Term
No ratings yet
COMP3411 Slides All Term
23 pages
Context Free Grammars
No ratings yet
Context Free Grammars
38 pages
Topic 4.1
No ratings yet
Topic 4.1
33 pages
NLP Reading Material-1
No ratings yet
NLP Reading Material-1
15 pages
Topic 4
No ratings yet
Topic 4
198 pages
New Toc
No ratings yet
New Toc
36 pages
PPT6 Chomsky Hierarchy PS
No ratings yet
PPT6 Chomsky Hierarchy PS
36 pages
Natural Language Processing Dossier 20231110 141736 0000
No ratings yet
Natural Language Processing Dossier 20231110 141736 0000
114 pages
Module 1 - Chapter 1
No ratings yet
Module 1 - Chapter 1
52 pages
NLP Module 2 1 (SAMI)
No ratings yet
NLP Module 2 1 (SAMI)
19 pages
Grammars
No ratings yet
Grammars
21 pages
r19 Ai Unit IV Chapter 1
100% (1)
r19 Ai Unit IV Chapter 1
19 pages
Unit 3
No ratings yet
Unit 3
11 pages
3b TextProcessing
No ratings yet
3b TextProcessing
32 pages
Automata Theory LecturesSlides Compressed
No ratings yet
Automata Theory LecturesSlides Compressed
141 pages
Toc U3
No ratings yet
Toc U3
54 pages
Basic Text Processing: Regular Expressions and Text Normalization
No ratings yet
Basic Text Processing: Regular Expressions and Text Normalization
53 pages
Basic Text Processing: Regular Expressions and Text Normalization
No ratings yet
Basic Text Processing: Regular Expressions and Text Normalization
53 pages
COMP3411 Week 7 - Computer Vision
No ratings yet
COMP3411 Week 7 - Computer Vision
58 pages
2.chapter3 - Regular Expressions and Automata
No ratings yet
2.chapter3 - Regular Expressions and Automata
28 pages
ATC Module 3
No ratings yet
ATC Module 3
38 pages
CIT316 Summary
No ratings yet
CIT316 Summary
21 pages
Lecture 2 o
No ratings yet
Lecture 2 o
79 pages
Context-Free Grammar Overview
No ratings yet
Context-Free Grammar Overview
74 pages
AI Presentation
No ratings yet
AI Presentation
14 pages
Natural Language Processing CS 1462: Some Slides Borrows From Carl Sable
No ratings yet
Natural Language Processing CS 1462: Some Slides Borrows From Carl Sable
54 pages
Basic Text Processing: Regular Expressions & Automata in NLP
No ratings yet
Basic Text Processing: Regular Expressions & Automata in NLP
27 pages
Formal Languages and Chomsky Hierarchy
No ratings yet
Formal Languages and Chomsky Hierarchy
41 pages
2 TextProc 2023
No ratings yet
2 TextProc 2023
74 pages
2 Grammars
No ratings yet
2 Grammars
28 pages
NLP Notes Unit-1
No ratings yet
NLP Notes Unit-1
20 pages
AI Language & Perception Guide
No ratings yet
AI Language & Perception Guide
19 pages
Topic - 9 (Natural Language Processing)
No ratings yet
Topic - 9 (Natural Language Processing)
21 pages
Language, Grammar & Automata Basics
No ratings yet
Language, Grammar & Automata Basics
32 pages
COMP3411 Week 2 - Search - Armin v1-1
No ratings yet
COMP3411 Week 2 - Search - Armin v1-1
146 pages
NLP 01
No ratings yet
NLP 01
15 pages
UnitIII Formal Languages and Grammars
No ratings yet
UnitIII Formal Languages and Grammars
34 pages
2015 Grammar 4 CS
No ratings yet
2015 Grammar 4 CS
19 pages
Unit 3
No ratings yet
Unit 3
25 pages
Formal Grammar and Chomsky Hierarchy
No ratings yet
Formal Grammar and Chomsky Hierarchy
7 pages
ATC Module 3
No ratings yet
ATC Module 3
40 pages
Natural Language Understanding Basics
No ratings yet
Natural Language Understanding Basics
41 pages
09 Parsing
No ratings yet
09 Parsing
11 pages
INFO 2950: Prof. Carla Gomes Gomes@cs - Cornell.edu
No ratings yet
INFO 2950: Prof. Carla Gomes Gomes@cs - Cornell.edu
43 pages
Formal Languages & Automata Basics
No ratings yet
Formal Languages & Automata Basics
36 pages
Formal Languages and Chomsky Hierarchy
No ratings yet
Formal Languages and Chomsky Hierarchy
36 pages
Syntax Analysis
No ratings yet
Syntax Analysis
87 pages
Computer Hardware Price List
No ratings yet
Computer Hardware Price List
7 pages
3408-Data Structure
No ratings yet
3408-Data Structure
3 pages
Control Engineering Basics
100% (1)
Control Engineering Basics
18 pages
L5 Compression
No ratings yet
L5 Compression
60 pages
VIDWAN
No ratings yet
VIDWAN
4 pages
Abb E-Clipse Bypass Configurations (BCR, BDR, VCR, or VDR) For Ach 550 User Manual
No ratings yet
Abb E-Clipse Bypass Configurations (BCR, BDR, VCR, or VDR) For Ach 550 User Manual
100 pages
Artificial Intelligence: Natural Language Processing II
No ratings yet
Artificial Intelligence: Natural Language Processing II
51 pages
State of The Art Reliability
No ratings yet
State of The Art Reliability
39 pages
Windguru - SEVILLA CAPITAL
No ratings yet
Windguru - SEVILLA CAPITAL
1 page
Fractal Geometry and Superformula To Model Natural Shapes Over The World
No ratings yet
Fractal Geometry and Superformula To Model Natural Shapes Over The World
15 pages
Western Systems Ruggedcom Rst916c
No ratings yet
Western Systems Ruggedcom Rst916c
5 pages
Practical SS7, Part 3
No ratings yet
Practical SS7, Part 3
7 pages
Project 12
No ratings yet
Project 12
44 pages
Least Mastered Competency: Consolidated
No ratings yet
Least Mastered Competency: Consolidated
2 pages
Lesson One Quantitative Techniques in Management
No ratings yet
Lesson One Quantitative Techniques in Management
5 pages
Pertemuan 3. Business Motivations and Drivers For Big Data Adoption
No ratings yet
Pertemuan 3. Business Motivations and Drivers For Big Data Adoption
16 pages
Mist Edge
No ratings yet
Mist Edge
2 pages
Quiz CH10&11 - Time Series Analysis and Forecasting & Predictive Data Mining - Preethi Chowdary Narra
No ratings yet
Quiz CH10&11 - Time Series Analysis and Forecasting & Predictive Data Mining - Preethi Chowdary Narra
4 pages
Deep Dive Microservices Architecture
No ratings yet
Deep Dive Microservices Architecture
2 pages
Diagnose IIS Performance Problems Using Windows Performance Monitor
No ratings yet
Diagnose IIS Performance Problems Using Windows Performance Monitor
2 pages
SPLA Licensing Best Practices
No ratings yet
SPLA Licensing Best Practices
1 page
GE3151 - Python
No ratings yet
GE3151 - Python
2 pages
LCD TV/DVD: Service Manual Circuit Diagrams
No ratings yet
LCD TV/DVD: Service Manual Circuit Diagrams
31 pages
My-Super-Cool-Firewall - Alarm-Thing
No ratings yet
My-Super-Cool-Firewall - Alarm-Thing
5 pages
UserGuide10 PDF
No ratings yet
UserGuide10 PDF
494 pages
vm51616H - Video - Matrix - Switch - Ds - en
No ratings yet
vm51616H - Video - Matrix - Switch - Ds - en
3 pages
Daniel B. Botkin - Forest Dynamics - An Ecological Model (1993) PDF
No ratings yet
Daniel B. Botkin - Forest Dynamics - An Ecological Model (1993) PDF
326 pages
Barangay Baracbac SK Annual Budget Fy 2019: Republic of The Philippines Province of Ilocos Sur Municipality of Galimuyod
No ratings yet
Barangay Baracbac SK Annual Budget Fy 2019: Republic of The Philippines Province of Ilocos Sur Municipality of Galimuyod
7 pages
DLL - Math6 - Week 1
No ratings yet
DLL - Math6 - Week 1
12 pages

COMP3411 Week 8 - Language Processing

Uploaded by

COMP3411 Week 8 - Language Processing

Uploaded by

Language Processing

COMP3411/9814: Artificial Intelligence

• Formal languages and grammars

• Minimum edit distance and words

• Natural language modelling: N-gram models

• Formal languages and grammars

• Minimum edit distance and words

• Natural language modelling: N-gram models

• Chatbots (customer service), personal assistant (Siri, Alexa),

• Central problem – Ambiguity:

• Ambiguity makes it difficult to interpret meaning.

• For instance, “The boy saw a girl with a telescope”.

Jack lost his wallet in his car.

Jack forgot his wallet.

Jack forgot his wallet.

I saw two bears.

E: So you have the engine assembly finished.

A: Yes. I got some when I bought the new lawnmower wheel.

E: Did it cost much?

A: No, and I could use another anyway.

E: OK. Have you got it attached yet?

Tracking focus isn’t enough

Anyway, they finally found a used one at a garage sale.

• Formal languages and grammars

• Minimum edit distance and words

• Natural language modelling: N-gram models

• A grammar rule is a formal device for defining sets of sequences of symbols.

determiner --> [a]

the mouse hates the cat

verb --> [scares]

• Internal nodes are labelled by determiner noun verb noun_phrase

• The parent-child relation is the cat scares determiner noun

specified by the rules of the

⇒ means “rewrites as”

• Grammatical formalisms can be classified

• Formal languages and grammars

• Minimum edit distance

• Natural language modelling: N-gram models

◦ Sophisticated sequences of regular

[e^] Either e or ^ Look here

1. Matching strings that we should not have matched (there, then,

2. Not matching things that we should have matched (The)

◦ Increasing accuracy or precision (minimizing false positives)

• Use parentheses () to "capture" a pattern into a numbered register (1,

• Use \1 to refer to the contents of the register

/the (.*)er they (.*), the \1er we \2/

Early NLP system that imitated a Rogerian psychotherapist

Uses pattern matching to match, e.g.,:

Men are all alike.

s/.* I’M (depressed|sad) .*/I AM SORRY TO HEAR YOU ARE \1/

"Seuss’s cat in the hat is different from other cats!"

Type: an element of the vocabulary.

Every NLP task requires text normalization:

1. Tokenizing (segmenting) words

Unix tools for space-based tokenization

Represent all words as their lemma, their shared root

◦ He is reading detective stories

• Formal languages and grammars

• Minimum edit distance

• Natural language modelling: N-gram models

• Also for Machine Translation, Information Extraction, Speech Recognition

Needed to transform one into the other

If each operation has cost of 1

If substitutions cost 2 (Levenshtein)

• Formal languages and grammars

• Minimum edit distance

• Natural language modelling: N-gram models

Goal: assign a probability to a sentence

Goal: compute the probability of a sentence or sequence of words:

Using conditional probabilities:

P(“its water is so transparent”) =

No! Too many possible sentences!

In other words, we approximate each component in the product

Some automatically generated sentences from a unigram model

fifth, an, of, futures, the, an, incorporated, a, a,

thrift, did, eighty, said, hard, 'm, july, bullish

/the (.)er they (.), the \1er we \2/