0% found this document useful (0 votes)

3 views87 pages

Chapter23 - Natural Language Processing

The document discusses the evolution and complexity of language, highlighting the distinction between natural and formal languages. It introduces Natural Language Processing (NLP) as a field of AI aimed at enabling computers to understand human language, while outlining the challenges faced in NLP due to language ambiguity, context-dependence, and variability. Additionally, it details various NLP tasks such as text classification, named entity recognition, and machine translation, along with the importance of word representation in NLP models.

Uploaded by

noah.getachew

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views87 pages

Chapter23 - Natural Language Processing

Uploaded by

noah.getachew

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 87

Artificial Intelligence:

Natural Language Processing

Masoud Makrehchi

Winter 2025
The history of language
• Language: A primary trait distinguishing
humans from animals
• Aristotle: Humans as “rational animals” with
speech
• Early signs of abstract thinking:
• Cave paintings as symbolic expression
• Evidence of communication beyond
immediate reality
• Language enabled:
• Imagination
• Storytelling
• Cultural memory

M. Makrehchi 2
The timeline of language
• Spoken language: Emerged around 100,000
years ago
• Written language: Appeared only about 5,000
years ago
• If human history were compressed into 24
hours:
• Writing began just 15 minutes ago
• Implication:
• History began with writing
• Vast majority of human experience is prehistoric

M. Makrehchi 3
Language

• Language is a system of symbols and rules used

for communication.
• Enables humans to express thoughts, emotions,
and ideas.
• Comprises multiple levels:
• Phonology – sounds
• Morphology – word structure
• Syntax – sentence structure
• Semantics – meaning
• Pragmatics – context and usage

M. Makrehchi 4
Language

M. Makrehchi 5
Formal vs. Natural Language
• Natural Language:
• Evolved organically among humans
• Flexible, ambiguous, context-dependent
• Formal Language:
• Designed for precision (e.g., math, programming)
• Rigid syntax, unambiguous semantics
• Key Differences:
• Ambiguity: Natural (yes), Formal (no)
• Redundancy: Natural (high), Formal (low)
• Context-dependence: Natural (yes), Formal (no)
• Tolerance for errors: Natural (yes), Formal (no)

M. Makrehchi 6
Examples of formal and natural languages

• Natural Languages:
• English, Arabic, Mandarin, Farsi, French,
Swahili...
• Evolved through human interaction
• Formal Languages:
• Mathematical notation: ∑, ∫, ∀x ∈ X
• Programming languages: Python, Java, C++
• Logic: Propositional logic (e.g., ¬P ∨ Q), First-
order logic

M. Makrehchi 7
What is Natural Language Processing (NLP)?

• NLP = Field of AI focused on enabling computers to

understand and use human language
• Combines:
• Linguistics
• Computer Science
• Machine Learning
• Key distinction:
• NLP: Goal-oriented, practical applications
• Computational Linguistics: Scientific, theory-driven
analysis

M. Makrehchi 8
NLP vs. CL

Aspect NLP Computational Linguistics

Focus Engineering & Applications Scientific Study of Language
Understand language using
Goal Build working systems
computation
Orientation Practical, task-driven Theoretical, explanatory
Examples of Chatbots, Machine Translation, Modeling syntax, studying
tasks Text Mining phonological rules

NLP asks: How can we build a system that understands

English instructions?
Computational Linguistics asks: How does the structure
of English sentences work computationally?

M. Makrehchi 9
Why is NLP Hard?
• Ambiguity:
• Words and sentences often have multiple meanings
• Context-dependence:
• Meaning changes with context, tone, or world knowledge
• Variability and Creativity:
• Many ways to say the same thing
• Imprecision and Errors:
• Typos, slang, grammatical errors
• Hidden Structure:
• Syntax and semantics are not always explicit
• World Knowledge Required:
• Understanding language often needs common sense or real-world
knowledge

M. Makrehchi 10
Ambiguity
Language is full of ambiguity at every level:
• Lexical:
"bank" (riverbank vs. financial institution)
• Syntactic:
"I saw the man with the telescope."
→ Who had the telescope?
• Semantic:
"Time flies like an arrow."
→ Is “flies” a verb or a noun?
Disambiguating meaning requires both context and
reasoning.
M. Makrehchi 11
Context-Dependence
Words often change meaning based on:
• Situation
• Cultural norms
• Prior sentences
Example:
“He was late. She was furious.” → Furious
about what?
Understanding requires inference beyond the
text.
M. Makrehchi 12
Variability and Creativity
Humans are endlessly creative in how they
express ideas:
• Idioms: “Kick the bucket”
• Sarcasm: “Oh great, another Monday!”
• Code-switching, dialects, and informal speech
all introduce variability that machines struggle
to interpret.

M. Makrehchi 13
Errors and Informality
Real-world language includes:
• Spelling mistakes
• Incomplete sentences
• Slang and emojis
Traditional models expect clean input—real text
is often messy.

M. Makrehchi 14
Hidden Structure
Grammar and meaning are often implicit, not
explicitly marked:
• Unlike programming languages, natural
language lacks fixed structure
• Parsing human sentences correctly is far from
trivial

M. Makrehchi 15
Requires Real-World and Common-Sense
Knowledge
Understanding language often requires knowing things
about the world:
• “He put the ice cream in the oven.” → Seems strange,
but why?
• This kind of inference depends on commonsense
reasoning, something machines still struggle with.

NLP is hard because language is not just data—it’s

human thought made audible or visible.
Understanding language means understanding
humans, with all their creativity, flaws, and
context. M. Makrehchi 16
What Are the Main NLP Tasks?
• Text Classification: Sentiment analysis, spam detection, topic labeling
• Named Entity Recognition (NER): Identify people, places, dates, etc.
• Part-of-Speech (POS) Tagging: Label words as nouns, verbs, adjectives,
etc.
• Parsing: Analyze grammatical structure (syntax trees)
• Machine Translation: Translate between languages
• Question Answering: Provide answers from text or knowledge base
• Text Summarization: Create shorter versions of content
• Text Generation: Compose coherent and fluent text
• Coreference Resolution: Determine which words refer to the same entity

M. Makrehchi 17
Text Classification
• Definition: Automatically assigning predefined
categories or labels to a given text.
• Applications:
• Sentiment analysis: "I love this movie!" →
Positive
• Spam detection: "Congratulations, you've
won a prize!" → Spam
• Topic classification: "Global warming is a
critical issue." → Environment
M. Makrehchi 18
Named Entity Recognition (NER)
• Definition: Identifying and classifying named entities in
text into predefined categories.
• Examples:
• "Barack Obama was born in Hawaii."
• [Barack Obama] → PERSON
• [Hawaii] → LOCATION
• "Apple announced the iPhone 15 in September."
• [Apple] → ORGANIZATION
• [iPhone 15] → PRODUCT
• [September] → DATE
M. Makrehchi 19
Part-of-Speech (POS) Tagging
• Definition: Labeling each word in a sentence
with its corresponding part of speech.
• Example:
• Sentence: "The quick brown fox jumps over
the lazy dog."
• Tags: The/DET quick/ADJ brown/ADJ
fox/NOUN jumps/VERB over/ADP the/DET
lazy/ADJ dog/NOUN

M. Makrehchi 20
Parsing
• Definition: Analyzing the grammatical structure of a
sentence.
• Types:
• Constituency Parsing: Breaks down the sentence into sub-
phrases.
• Dependency Parsing: Shows grammatical relationships
between words.

• Example (Dependency):
• "She enjoys reading books."
• Subject: She → enjoys
• Object: books → reading → enjoys

M. Makrehchi 21
Parsing
• Short-Term Dependency:
• Nearby words affect each other directly
• Example:
“The dog barked loudly.” → “dog” helps predict “barked”
• Long-Term Dependency:
• A word depends on another word far back in the
sentence or text
• Example:
“The book that the professor who the students admired
wrote was fascinating.”
→ “book” connects to “was fascinating”

M. Makrehchi 22
Machine Translation
• Definition: Automatically translating text from one
language to another.
• Example:
• Input: "Hello, how are you?"
• Output (French): "Bonjour, comment ça va ?"
• Applications:
• Travel and tourism
• Global business communication

M. Makrehchi 23
Question Answering (QA)
• Definition: Given a question, the system
provides a concise and correct answer.
• Types:
• Closed-domain: Specific topics (e.g., medical,
legal)
• Open-domain: General knowledge
• Example:
• Question: "What is the capital of Japan?"
• Answer: "Tokyo"
M. Makrehchi 24
Text Summarization
• Definition: Producing a shorter version of a longer text
while retaining its main ideas.
• Types:
• Extractive: Select key sentences
• Abstractive: Generate summary using paraphrased
text
• Example:
• Input: "Artificial intelligence is transforming
industries across the globe..."
• Output: "AI is revolutionizing global industries."

M. Makrehchi 25
Text Generation
• Definition: Automatically generating
coherent and contextually relevant text.
• Examples:
• Input prompt: "Write a story about a
robot that learns to love."
• Output: "Once upon a time, in a distant
future, a lonely robot named Aria..."

M. Makrehchi 26
Coreference Resolution
• Definition: Determining when different expressions
refer to the same entity.
• Example:
• "Ava dropped her phone. She picked it up."
• "She" → Ava
• "it" → phone
• Importance:
• Essential for understanding continuity in text and
dialogue.

M. Makrehchi 27
Word Representation in NLP
• Words must be converted to numbers for machines to
process
• Common types of word representations:
• One-hot (encoding) vectors: Binary, sparse, no semantic
similarity
• Word embeddings: Dense vectors capturing meaning
• Word2Vec, GloVe, FastText
• Contextual embeddings: Word meaning varies with context
• ELMo, BERT, GPT

• Goal: Capture semantic similarity and contextual

nuance

M. Makrehchi 28
One-Hot Encoding
• Each word is represented by a vector with all
zeros except a one in the position assigned to
that word.
• Example (vocabulary of 5 words):
• “cat” → [0, 1, 0, 0, 0]
• Limitations:
• High dimensional and sparse
• Doesn’t capture meaning or similarity
between words
M. Makrehchi 29
Word Embeddings
• Words are mapped to dense, low-dimensional vectors
based on their usage in language.
• Words with similar meanings have similar vectors.
• Popular models:
• Word2Vec: Predicts a word from its context (or vice versa)
• GloVe: Combines local and global word co-occurrence
• FastText: Includes subword information for better handling of
rare words

• Example:
• vector("king") – vector("man") + vector("woman") ≈
vector("queen")

M. Makrehchi 30
Contextual Embeddings
• Traditional embeddings assign one vector per word,
regardless of context.
• Contextual embeddings adapt based on the sentence.
• Example:
• “bank” in:
• “He sat by the river bank”
• “She went to the bank to deposit money”
• Models like ELMo, BERT, and GPT generate different
vectors for each use based on context.

M. Makrehchi 31
Why word-representation matters?
• Word representations are the input layer for
most NLP models. Better representations lead
to:
• Improved semantic understanding
• More accurate predictions
• Robust performance across tasks
• As models have evolved, contextualized
representations have become the norm, especially
in systems based on transformers and foundation
models.

M. Makrehchi 32
Text Data
• Usually natural language text (human generated)
• Unstructured or semi-structured
• Features or variables are tokens of text (characters,
words, n-grams, sentences, part of speech, named
entities, semantics, and so on) → terms
• Data set: Corpora (corpus)
• Represented by document-term matrix (n x m)
• A very big and sparse matrix

33
Text (Document) Data Representation
• Lexical • Semantic
• Character • Collaborative tagging /
Web2.0
• Words
• Templates / Frames
• Phrases
• Ontologies / First order
• Part-of-speech tags theories

• Syntactic
• Taxonomies / thesauri
• Vector-space model
• Language models
• Full vs. Skeleton parsing
• Cross-modality: in multi-
media data mining
M. Makrehchi 34
Character Level
• Sequences of characters are extracted
• A document is represented by a frequency distribution
of sequences
• Each character sequence of length 1, 2, 3, … represents a
feature with its frequency
• Good and bad sides
• It is very robust since it avoids language morphology: useful
for language identification
• It captures simple patterns on character level: useful for spam
and plagiarism detection
• For deeper semantic tasks, the representation is too weak
Word level
• The most common representation: bag of words,
word embedding, etc
• Important to know:
• Word is well defined unit in western languages – e.g.
Chinese has different notion of semantic unit

• Pre-processing process
• Tokenization
• Converting to lower case?
• Stemming?
• Stop-word reduction?
• Band-pass filtering?
• Document length normalization
Relations between Words (1)
• Local relations:
• Co-occurrence and co-location: two word often
appear together in documents (or in a specific
category of documents)
• Whole document
• A sequence or in a sliding window
• If we see “united” in a document what is the
probability that the next word is “nations” and what
is the probability of “states”?
• We can learn from lots of example documents.
WordNet example 26 relations
116k senses
chicken
Is_a poultry Purpose supply Typ_obj
clean Is_a Quesp
smooth Typ_obj keep
Is_a hen duck
Is_a
Typ_obj Purpose meat
preen Typ_subj Caused_by
Is_a egg
Means quack
Not_is_a plant
chatter Typ_subj animal
Is_a Is_a
Is_a Is_a sense creature
make bird Is_a
Typ_obj sound
gaggle Part feather
Is_a Is_a
Classifier goose relation wing
peck Is_a limb
Is_a
number Typ_subj Is_a
claw
Is_a Means sense Is_a
beak Part Part
hawk Is_a
Typ_obj
strike Typ_subj
fly
leg
turtle catch
Is_a Typ_subj Is_a
bill arm
face Location mouth Is_a opening
Word Frequency
• Word frequencies in texts have power law
distribution:
• Small number of very frequent words
• Big number of low frequency words

39
Zipf’s Law

M. Makrehchi 40
Zipf’s Law
• Even in a very large corpus, there will be a lot
of infrequent words
• The same holds for many other levels of
linguistic structure
• Core NLP challenge: we need to estimate
probabilities or to be able to make predictions
for things we have rarely or never seen

M. Makrehchi 41
Stop-words
• Stop-words are words that from non-linguistic view do
not carry information
• They have mainly functional roles
• Usually we remove them to help the methods to perform
better
• Stop words are language dependent – examples:
• English: A, ABOUT, ABOVE, ACROSS, AFTER, AGAIN, AGAINST,
ALL, ALMOST, ALONE, ALONG, ALREADY
• Dutch: de, en, van, ik, te, dat, die, in, een, hij, het, niet, zijn, is,
was, op, aan, met, als, voor, had, er, maar, om, hem, dan, zou,
of, wat, mijn, men, dit, zo, ...
• Slovenian: A, AH, AHA, ALI, AMPAK, BAJE, BODISI, BOJDA,
BRŽKONE, BRŽČAS, BREZ, CELO, DA, DO, ...
Stemming (1)
• Different forms of the same word are usually
problematic for text data analysis, because they
have different spelling and similar meaning (e.g.
learns, learned, learning,…)
• Stemming is a process of transforming a word into
its stem (normalized form)
• Stemming provides an inexpensive mechanism to
merge
• Online stemming
Stemming (2)
• For English we mostly use Porter stemmer
http://www.tartarus.org/~martin/PorterStemmer/
• Example cascade rules used in English Porter stemmer
• ATIONAL -> ATE relational -> relate
• TIONAL -> TION conditional -> condition
• ENCI -> ENCE valenci -> valence
• ANCI -> ANCE hesitanci -> hesitance
• IZER -> IZE digitizer -> digitize
• ABLI -> ABLE conformabli -> conformable
• ALLI -> AL radicalli -> radical
• ENTLI -> ENT differentli -> different
• ELI -> E vileli -> vile
• OUSLI -> OUS analogousli -> analogous
Lemmatization

• Definition: Lemmatization reduces a word to its lemma

(dictionary form), considering the word's part of speech
(POS) and morphology.
• Output: Always a valid word.
• Method: Uses vocabulary and morphological analysis.
• Example:
• running (verb) → run
• better (adj) → good
• was (verb) → be
• Tools: WordNet Lemmatizer, spaCy Lemmatizer
• Pros: More accurate and meaningful.
• Cons: Slower and requires more linguistic resources.

M. Makrehchi 45
Stemming vs. lematization

Feature Stemming Lemmatization

Method Rule-based Dictionary-based
Root form (may not Base form (valid
Output
be a word) word)
Speed Faster Slower
Accuracy Less More
Context-aware No Yes (requires POS)
Applications needing
Use case Quick and dirty NLP
linguistic precision

M. Makrehchi 46
Phrase Level
• Instead of having just single words we can deal
with phrases
• We use two types of phrases:
• Phrases as frequent contiguous word sequences
• Phrases as frequent non-contiguous word
sequences
• Both types of phrases could be identified by
simple dynamic programming algorithm
• The main effect of using phrases is to more
precisely identify senses
47
Part-of-Speech level
• Introduces word-types to differentiate words functions
• For text-analysis part-of-speech information is used mainly for
“information extraction” where we are interested in e.g.
named entities which are “noun phrases”
• Another possible use is for feature reduction
• it is known that nouns carry most of the information in text
documents

• Part-of-Speech taggers are usually learned by HMM

algorithm on manually tagged data
• Online part-of-speech tagger
Part-of-Speech Table
Part-of-Speech examples
Vector-space model level
• The most common way to represent documents is
Vector Space Model
• first to transform them into sparse numeric vectors and
• then deal with them with linear algebra operations
• We ignore the linguistic structure within the text (word
order will be lost)
• It is called “structural curse” because ignoring the
linguistic structure doesn’t harm efficiency of solutions
in many problems
• This representation is also called “Bag-Of-Words”
• Typical tasks on vector-space-model are classification,
clustering, visualization etc.
Bag-of-words document representation
Word (term) Weighting
• In the bag-of-words representation each word is represented as a
separate variable having numeric weight (importance)
• The most popular weighting schema is normalized word frequency
TFIDF:

N
tfidf ( w) = tf . log( )
df ( w)
• tf(w) – term frequency (number of word occurrences in a document)
• df(w) – document frequency (number of documents containing the word)
• N – number of all documents
• tfIdf(w) – relative importance of the word in the document

The word is more important if it appears

The word is more important if it
several times in a target document
appears in less documents
Text Data Example (1)
• TRUMP MAKES BID FOR CONTROL OF RESORTS Casino owner
and real estate Donald Trump has offered to acquire all Class B
common shares of Resorts International Inc, a spokesman for Trump
said. The estate of late Resorts chairman James M. Crosby owns
340,783 of the 752,297 Class B shares. Resorts also has about
6,432,000 Class A common shares outstanding. Each Class B share
has 100 times the voting power of a Class A share, giving the Class
B stock about 93 pct of Resorts' voting power.
Original text
• [RESORTS:0.624] [CLASS:0.487] [TRUMP:0.367] [VOTING:0.171]
[ESTATE:0.166] [POWER:0.134] [CROSBY:0.134] [CASINO:0.119]
[DEVELOPER:0.118] [SHARES:0.117] [OWNER:0.102] Bag-of-Words
[DONALD:0.097] [COMMON:0.093] [GIVING:0.081] [OWNS:0.080]
[MAKES:0.078] [TIMES:0.075] [SHARE:0.072] [JAMES:0.070] representation
[REAL:0.068] [CONTROL:0.065] [ACQUIRE:0.064] (high dimensional
[OFFERED:0.063] [BID:0.063] [LATE:0.062] [OUTSTANDING:0.056]
[SPOKESMAN:0.049] [CHAIRMAN:0.049] [INTERNATIONAL:0.041] sparse vector)
[STOCK:0.035] [YORK:0.035] [PCT:0.022] [MARCH:0.011]
Text Data Example (2)

In a short speech at the Pentagon, President George W

Bush Said troops were making a "steady advance". Mr
Bush said he did not know how long the war would last,
but he knew its outcome: "The Iraqi regime will be
disarmed. The Iraqi regime will be ended." He
confirmed that he had asked Congress for an additional
$74.7bn to cover the war effort, humanitarian aid and
post-war reconstruction of Iraq. Major-General Victor
Renuart of US Central Command said 1,400 air sorties
against the Iraqi Republican Guard were scheduled for
Tuesday.
»Source: BBC News March 25, 2003

55
Text Data Example (3)

In a short speech at the Pentagon, President George W

Bush Said troops were making a "steady advance". Mr
Bush said he did not know how long the war would last,
but he knew its outcome: "The Iraqi regime will be
disarmed. The Iraqi regime will be ended." He
confirmed that he had asked Congress for an additional
$74.7bn to cover the war effort, humanitarian aid and
post-war reconstruction of Iraq. Major-General Victor
Renuart of US Center Command said 1,400 air sortie
against the Iraqi Republican Guard were scheduled for
Tuesday.
»Source: BBC News March 25, 2003

56
Text Data Example (4)

speech pentagon president bush troop

steady advance war outcome iraq
regime disarm end confirm congress
addition cover effort human aid
post-war construct ajor-general renuart us center
command air sortie against republic
guard schedule tuesday
Text Data Example (5)

2342 8977 7973 3736 2323 7656 78676 12443 7677

3435 7867 8573 442 4586 9577 25574 96739 35657
6887 6785 47989 9988 75758 79799 121 6554 45345
656 345 25425 24341 34848 8909 57048

Document Representation (Term Vector)

D=< 0, 0, …, 1, 0, 0, …, 1,0, …. , 0 1, 0, ….,1 ,0, 0,0 …>

Text Data Example (6)

Terms
T1 T2 T3 T4 …… Tn
Documents D1 1 0 0 0 1
D2 0 1 0 1 0
D3 0 0 0 1 0
D4 1 0 1 0 1
. Binary
Feature
.
.
Dm 0 1 1 0 1
Text Data Example (7)

Terms
T1 T2 T3 T4 …… Tn
Documents D1 0.5 0.1 0.1 0.0 0.7
D2 0.4 1.0 0.2 0.9 0.2
D3 0.1 0.1 0.0 1.0 0.3
D4 0.8 0.0 0.7 0.0 0.6
. Weighted
Feature
.
.
Dm 0.2 1.0 0.9 0.0 0.8
Main Objective in NLP — Language Modeling

• Language Modeling: Core task in NLP

• Goal: Assign probabilities to sequences of
words
• Applications:
• Speech recognition
• Machine translation
• Text generation
• Foundation for modern NLP systems

M. Makrehchi 61
Language modeling: Statistical Language Models
(SLM)
• Theoretical background: Based on Markov
Assumption
• The Markov assumption: the probability of
an event occurring in the future (in a
sequence) depends only on a finite, fixed
number of past events, rather than the
entire history.
• In other words, it assumes that the future is
conditionally independent of the past,
given a limited "context window" of
previous events.

62
Language modeling: Statistical Language Models
(SLM)
• The level of this assumption is often referred to as the
"order" of the Markov model:
• First-order Markov Model (Markov Chain): The probability of an
event depends only on the immediately preceding event. This is
also known as a "first-order Markov chain.“
• Higher-order Markov Models: The probability of an event depends
on a fixed number of preceding events. For example, in a second-
order Markov model, the probability of an event depends on the
two preceding events, and so on.

• Give a fixed context length (n-gram), predict the next word

• Suffers from Curse of dimensionality and sparsity problem

63
Language modeling: Neural Language Models
(NLM)
• Theoretical background (linguistics): Distributional
representation (John Rupert Firth , 1957)
• The distributional hypothesis posits that words that
occur in similar contexts tend to have similar
meanings. In other words, words that are used in
similar ways in text are likely to be related in
meaning.

64
Language modeling: Neural Language Models
(NLM)

J. R. Firth (1890–1960) — British

linguist
• Famous for the phrase:
"You shall know a word by the
company it keeps."
• Advocated the idea that meaning
comes from context, not just
dictionaries
• Father of distributional hypothesis,
foundational to word embeddings
and NLP

65
Language modeling: Neural Language Models
(NLM)

• Word prediction depends on the aggregated context

features (distributed representation or word embedding)
• Based on Neural Network Architecture
• Such as Word2vec (2013): a shallow neural network
• Distributional representation is the foundation of all word-
embedding language models.

66
What is a Language Model?
• A language model estimates the probability of a
sequence of words
• Example: P("The cat sat on the mat") > P("Cat the mat on sat
the")
• Purpose: Capture patterns and structure in human language

• Types:
• Probabilistic Language Models: N-gram models
Neural: RNN, LSTM, Transformer-based models
(e.g., GPT)

M. Makrehchi 67
Language modeling
• Goal: Assigning probability to a sequence of words
• For text understanding:
𝑝(“𝑇ℎ𝑒 𝑐𝑎𝑡 𝑖𝑠 𝑜𝑛 𝑡ℎ𝑒 𝑚𝑎𝑡”) > 𝑝(“𝑇𝑟𝑢𝑐𝑘 𝑡ℎ𝑒 𝑒𝑎𝑟𝑡ℎ 𝑜𝑛”)

• For text generation:

𝑝(𝑤 | “𝑇ℎ𝑒 𝑐𝑎𝑡 𝑖𝑠 𝑜𝑛 𝑡ℎ𝑒”) −> “𝑚𝑎𝑡”

M. Makrehchi 68
Modern NLP: Everything as a Text-to-Text
Task
• Modern NLP uses foundation models (e.g., GPT, T5, PaLM)
• Core idea: Convert any NLP task into text-to-text format
• Examples:
• Sentiment Analysis:
The movie’s closing scene is attractive; it was → "good"
• Machine Translation:
“Hello world” in French is → "Bonjour le monde"
• Question Answering:
Which city is UVA located in? → "Charlottesville"
All are reduced to language modeling problems

M. Makrehchi 69
Probabilistic Language Modeling —
Autoregressive Assumption

• Language modeled as a probability distribution over word

sequences
• Autoregressive assumption:
• Probability of a sentence is the product of conditional probabilities:
𝑷(𝒘₁, 𝒘₂, … , 𝒘ₙ)
= 𝑷(𝒘₁) × 𝑷(𝒘₂|𝒘₁) × 𝑷(𝒘₃|𝒘₁, 𝒘₂) × …
× 𝑷(𝒘ₙ|𝒘₁, … , 𝒘ₙ₋₁)
• Predict each word based on previous tokens only
• Foundation of many models:
• N-grams, RNNs, LSTMs, Transformers (e.g., GPT)

M. Makrehchi 70
N-grams in Language Modeling
• N-gram = sequence of n words
• Approximates language using the Markov
assumption:
• A word depends only on the last (𝑛– 1) words
• Common examples:
• Unigram: 𝑃(𝑤₁)
• Bigram: 𝑃(𝑤₂ | 𝑤₁)
• Trigram: 𝑃(𝑤₃ | 𝑤₁, 𝑤₂)
• Trade-off:
• Higher n = better context, but needs more data

M. Makrehchi 71
N-grams in Language Modeling
An N-gram is a simple yet powerful model used to estimate the probability
of a word based on the (n–1) previous words. This approach uses the
Markov assumption, which simplifies the complex dependencies in
language by assuming that the current word is only influenced by a limited
history.

N-gram model estimates

𝑃(𝑤₁, 𝑤₂, … , 𝑤ₙ) ≈ ∏ 𝑃(𝑤ᵢ | 𝑤ᵢ₋ₙ₊₁, … , 𝑤ᵢ₋₁)
• For example:
• Unigram model: Assumes each word is independent
𝑃(𝑡ℎ𝑒 𝑐𝑎𝑡 𝑠𝑎𝑡) = 𝑃(𝑡ℎ𝑒) × 𝑃(𝑐𝑎𝑡) × 𝑃(𝑠𝑎𝑡)
• Bigram model: Each word depends only on the one before it
𝑃(𝑡ℎ𝑒 𝑐𝑎𝑡 𝑠𝑎𝑡) = 𝑃(𝑡ℎ𝑒) × 𝑃(𝑐𝑎𝑡 | 𝑡ℎ𝑒) × 𝑃(𝑠𝑎𝑡 | 𝑐𝑎𝑡)
• Trigram model: Each word depends on the two preceding words
𝑃(𝑡ℎ𝑒 𝑐𝑎𝑡 𝑠𝑎𝑡) = 𝑃(𝑡ℎ𝑒) × 𝑃(𝑐𝑎𝑡 | 𝑡ℎ𝑒) × 𝑃(𝑠𝑎𝑡 | 𝑡ℎ𝑒, 𝑐𝑎𝑡)

M. Makrehchi 72
Example of Trigram (n=3) Language Model

M. Makrehchi 73
Example of Trigram (n=3) Language Model

M. Makrehchi 74
Example of Trigram (n=3) Language Model

M. Makrehchi 75
Example of Trigram (n=3) Language Model

The general formula for bigrams (n=2):

P(wᵢ | wᵢ₋₁) = Count(wᵢ₋₁, wᵢ) / Count(wᵢ₋₁)

For trigrams:

P(wᵢ | wᵢ₋₂, wᵢ₋₁) = Count(wᵢ₋₂, wᵢ₋₁, wᵢ) / Count(wᵢ₋₂, wᵢ₋₁)

M. Makrehchi 76
N-Gram generative models
• Language modeling is about determining
probability of a sequence of words
• The task is typically reduced to the estimating
probabilities of a next word given two previous words (tri-
gram model):
Frequencies
of word
sequences

• It has many applications including speech recognition,

OCR, handwriting recognition, machine translation, and
spelling correction
Example of Trigram (n=3) Language Model

M. Makrehchi 78
N-grams in Language Modeling

• Trade-offs:
• Smaller n (e.g., unigram, bigram):
• Less memory and faster training
• Poor understanding of context

• Larger n (e.g., 4-grams, 5-grams):

• Better contextual predictions
• Require more data and suffer from sparsity

M. Makrehchi 79
N-grams in Language Modeling
• Limitations:
• Can't capture long-range dependencies
• Fails when the sequence has not been seen in training (zero
probabilities)
• Needs smoothing techniques (e.g., Laplace, Kneser-Ney) to adjust
for unseen sequences

• Despite being replaced by neural models in many

applications, N-grams are still widely used in:
• Spelling correction
• Simple chatbots
• Language detection
• As interpretable baselines for evaluation

M. Makrehchi 80
N-Gram generative models
• Suppose we have a language model that gives us
the estimate of 𝑃 𝑤 𝑥1 , 𝑥2 , … , 𝑥𝑛 we can generate
the next tokens one-by-one!
1. Sampling: 𝑥𝑖 ~𝑃 𝑤 𝑥1 , 𝑥2 , … , 𝑥𝑛
2. 𝑥𝑖 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑤 𝑃 𝑤 𝑥1 , 𝑥2 , … , 𝑥𝑛

• But how do we know when to stop generation?

• Use a special symbol [EOS] (end-of-sequence) to
denote stopping

M. Makrehchi 81
Softmax: Probability Calculation
Softmax is a mathematical function used to normalize
scores into a valid probability distribution:
Given scores (or logits) → Softmax → Probabilities (sum to
1)
For n-gram models:
• You don't usually use softmax explicitly because prdirectly using
counts.
• Neural language models use softmax to turn raw scores into
probabilities.
• obabilities are computed Softmax is used for:
• Training (predict the next word)
• Evaluating how likely a sentence is (probabilistic modeling)

M. Makrehchi 82
Sampling: Generation Step
• Sampling is what you do after you have probabilities to actually
pick the next word.
• You have two options:
• Greedy: Always choose the word with the highest probability.
• Leads to repetitive, sometimes dull text.
• Sampling:
• Treat the probabilities as a distribution.
• Randomly select the next word according to its probability.
• This introduces variation, creativity, and diversity in
generated text.

M. Makrehchi 83
Sampling: Example

M. Makrehchi 84
N-Gram generative models
• Recursively sample 𝑥𝑖 ~𝑃 𝑤 𝑥1 , 𝑥2 , … , 𝑥𝑛 until we generate
[EOS]
• Generate the first word: “the”  𝑥1 ~𝑝 𝑤 𝐵𝑂𝑆
• Generate the second word: “cat”  𝑥2 ~𝑝 𝑤 ”𝑡ℎ𝑒”
• Generate the third word: “is”  𝑥3 ~𝑝 𝑤 ”𝑡ℎ𝑒𝑐𝑎𝑡”
• Generate the fourth word: “on”  𝑥4 ~𝑝 𝑤 ”𝑡ℎ𝑒𝑐𝑎𝑡𝑖𝑠”
• Generate the fifth word: “the”  𝑥5 ~𝑝 𝑤 ”𝑡ℎ𝑒𝑐𝑎𝑡𝑖𝑠𝑜𝑛”
• Generate the sixth word: “mat”  𝑥6 ~𝑝 𝑤 ”𝑡ℎ𝑒𝑐𝑎𝑡𝑖𝑠𝑜𝑛𝑡ℎ𝑒”
• Generate the sixth word:
“[EOS]”  𝑥7 ~𝑝 𝑤 ”𝑡ℎ𝑒𝑐𝑎𝑡𝑖𝑠𝑜𝑛𝑡ℎ𝑒𝑚𝑎𝑡”

M. Makrehchi 85
Google N-Gram
• Google N-gram
• Some statistics of the corpus:
• File sizes: approx. 24 GB compressed (gzip'ed)
text files
• Number of tokens: 1,024,908,267,229
• Number of sentences: 95,119,665,584
• Number of unigrams: 13,588,391
• Number of bigrams: 314,843,401
• Number of trigrams: 977,069,902
• Number of fourgrams: 1,313,818,354
• Number of fivegrams: 1,176,470,663

86
History of Language Models — The Four Eras
Era Key Models Characteristics
Probabilistic Rule-based, sparse, short
< 2000 models N-gram, HMM
memory
Shallow neural
2000– models Word2Vec, Embeddings, sequential
2018 RNN, GloVe models

Small pre-
2018– trained BERT, GPT-2,
transformers Pre-training + fine-tuning
2022 T5

Large
language GPT-3.5, GPT- Massive scale, general-
2022+
models 4, PaLM purpose LLMs
M. Makrehchi 87

22533-2022-Winter-Model-Answer-Paper (Msbte Study Resources)
No ratings yet
22533-2022-Winter-Model-Answer-Paper (Msbte Study Resources)
22 pages
CCTV Training Sample Questions
0% (2)
CCTV Training Sample Questions
4 pages
Unit 1-Introduction To NLP
No ratings yet
Unit 1-Introduction To NLP
68 pages
Introduction To NLP - First - Week - Lecture - 1st
No ratings yet
Introduction To NLP - First - Week - Lecture - 1st
6 pages
NLP Merged
100% (1)
NLP Merged
975 pages
Intro NLP
No ratings yet
Intro NLP
47 pages
Unit-4 NLP
No ratings yet
Unit-4 NLP
54 pages
I6079-NATM Tunnel Reinforcement Quantity Details
100% (1)
I6079-NATM Tunnel Reinforcement Quantity Details
1 page
Unit 1
No ratings yet
Unit 1
99 pages
Natural Language Processing
No ratings yet
Natural Language Processing
29 pages
List Mechanical Procedure Qualification Test (API 1104) 2018 (CEPU)
No ratings yet
List Mechanical Procedure Qualification Test (API 1104) 2018 (CEPU)
5 pages
Chap 1
No ratings yet
Chap 1
54 pages
Aids Module 5
No ratings yet
Aids Module 5
35 pages
NLP Important Question and Answers Module Wise
No ratings yet
NLP Important Question and Answers Module Wise
101 pages
1.2 Chap NLP Intro-2
No ratings yet
1.2 Chap NLP Intro-2
46 pages
NLP Unit 1 & 2
No ratings yet
NLP Unit 1 & 2
29 pages
Module 1
No ratings yet
Module 1
27 pages
Unit-I NLP
No ratings yet
Unit-I NLP
15 pages
Natural Language Processing Tools and Approaches
No ratings yet
Natural Language Processing Tools and Approaches
106 pages
AI M3 Merged PDF
No ratings yet
AI M3 Merged PDF
98 pages
Unit I
No ratings yet
Unit I
36 pages
TOPIC 4 Natural Language Processing
No ratings yet
TOPIC 4 Natural Language Processing
26 pages
Notes MSC NLP
No ratings yet
Notes MSC NLP
36 pages
NLP CH 1
No ratings yet
NLP CH 1
8 pages
QP-2 CS
No ratings yet
QP-2 CS
7 pages
Unit-I NLP
No ratings yet
Unit-I NLP
37 pages
Module 3 Notes
No ratings yet
Module 3 Notes
26 pages
NLP Module 1
No ratings yet
NLP Module 1
124 pages
Introduction To NLPAbebe Zerihun
No ratings yet
Introduction To NLPAbebe Zerihun
45 pages
Natural Language Processin1
No ratings yet
Natural Language Processin1
86 pages
Unit 4
No ratings yet
Unit 4
39 pages
NLP Presentation1
No ratings yet
NLP Presentation1
25 pages
Introduction to Natural Language Processing
No ratings yet
Introduction to Natural Language Processing
88 pages
Mod 1
No ratings yet
Mod 1
71 pages
1 NLP
No ratings yet
1 NLP
26 pages
Natural Language Processing Unit 1-2
No ratings yet
Natural Language Processing Unit 1-2
18 pages
Chapter 6
No ratings yet
Chapter 6
21 pages
Brocode OP
No ratings yet
Brocode OP
133 pages
Course Code HUM1012 Logic and Language Structure BL202425040 0921 D21+D22
No ratings yet
Course Code HUM1012 Logic and Language Structure BL202425040 0921 D21+D22
55 pages
NLP Handwritten Notes
No ratings yet
NLP Handwritten Notes
26 pages
NLP Module 1
No ratings yet
NLP Module 1
31 pages
Introduction
No ratings yet
Introduction
49 pages
Basic NLP To End-To-End Pipeline .PPTX - Removed
No ratings yet
Basic NLP To End-To-End Pipeline .PPTX - Removed
35 pages
Introduction To Natural Language Processing (NLP)
No ratings yet
Introduction To Natural Language Processing (NLP)
87 pages
Unit1 A
No ratings yet
Unit1 A
8 pages
Topic 2: Introduction To Natural Language Processing (NLP)
No ratings yet
Topic 2: Introduction To Natural Language Processing (NLP)
16 pages
Foundation For NLP
No ratings yet
Foundation For NLP
14 pages
NLP Basics for Computer Science Students
No ratings yet
NLP Basics for Computer Science Students
87 pages
Part01 Overview
No ratings yet
Part01 Overview
31 pages
Unit 7
No ratings yet
Unit 7
17 pages
NLP Self
No ratings yet
NLP Self
22 pages
Natural Language Processing Guide
No ratings yet
Natural Language Processing Guide
21 pages
NLP for AI and Business Solutions
No ratings yet
NLP for AI and Business Solutions
13 pages
Part01 Overview
No ratings yet
Part01 Overview
31 pages
Introduction To NLP - Part 1
No ratings yet
Introduction To NLP - Part 1
23 pages
2-Lecture Two - (Back Ground of NLP)
No ratings yet
2-Lecture Two - (Back Ground of NLP)
65 pages
Properties and Classifications of Bamboo For Const
No ratings yet
Properties and Classifications of Bamboo For Const
11 pages
NLP PPT1
No ratings yet
NLP PPT1
29 pages
What Is Natural Language Processing?
No ratings yet
What Is Natural Language Processing?
5 pages
Fuzzy Logic for Computing Students
No ratings yet
Fuzzy Logic for Computing Students
69 pages
Getting - Started With Cisco Intersight
No ratings yet
Getting - Started With Cisco Intersight
12 pages
Stacey Evans Letter To Cobb 11.17.22
No ratings yet
Stacey Evans Letter To Cobb 11.17.22
14 pages
TTL 1 UNIT 1 Intro and Lesson 1 T
No ratings yet
TTL 1 UNIT 1 Intro and Lesson 1 T
32 pages
Intake and Exhaust: Group 15
No ratings yet
Intake and Exhaust: Group 15
20 pages
NLP Guide: Theory & Practice
No ratings yet
NLP Guide: Theory & Practice
26 pages
Schneider Electric - Altivar-31-Variable-Speed-Drives-VFD-Legacy - ATV31HU40N4
No ratings yet
Schneider Electric - Altivar-31-Variable-Speed-Drives-VFD-Legacy - ATV31HU40N4
4 pages
Compro & Legalitas PT Fabina Cipta Pesona Apr 25
No ratings yet
Compro & Legalitas PT Fabina Cipta Pesona Apr 25
79 pages
Chapter Shutdown
No ratings yet
Chapter Shutdown
31 pages
FMB920 Tracker Setup Guide
No ratings yet
FMB920 Tracker Setup Guide
16 pages
Exercise Getting Started v1 0
No ratings yet
Exercise Getting Started v1 0
3 pages
Group 2 Research Paper Chapter 12
No ratings yet
Group 2 Research Paper Chapter 12
24 pages
My Tasks Fiori App
No ratings yet
My Tasks Fiori App
4 pages
Eu Customs Code EPRS STU (2018) 621854 en
No ratings yet
Eu Customs Code EPRS STU (2018) 621854 en
64 pages
WaterShapes - Hydraulics-Hot-Tub-Concrete-Spa-Jets-Hydrotherapy-Venturi-Hartford-Loop
No ratings yet
WaterShapes - Hydraulics-Hot-Tub-Concrete-Spa-Jets-Hydrotherapy-Venturi-Hartford-Loop
7 pages
Futsal Booking System Report
No ratings yet
Futsal Booking System Report
39 pages
Building Services for B.Tech Students
No ratings yet
Building Services for B.Tech Students
12 pages
Hussein 2015
No ratings yet
Hussein 2015
4 pages
Dutch Fintech Map 2022: Ecosystem Insights
No ratings yet
Dutch Fintech Map 2022: Ecosystem Insights
16 pages
Allied Meditec 1100 October 2023 Ver23-10
No ratings yet
Allied Meditec 1100 October 2023 Ver23-10
2 pages
Online STTP Schdule
No ratings yet
Online STTP Schdule
1 page
DAA Q Bank CAE2
No ratings yet
DAA Q Bank CAE2
9 pages
Quiz 2
No ratings yet
Quiz 2
4 pages
AI Intern with IoT Specialization
No ratings yet
AI Intern with IoT Specialization
1 page

Chapter23 - Natural Language Processing

Uploaded by

Chapter23 - Natural Language Processing

Uploaded by

Artificial Intelligence:

Natural Language Processing

• Language is a system of symbols and rules used

• NLP = Field of AI focused on enabling computers to

Aspect NLP Computational Linguistics

NLP asks: How can we build a system that understands

NLP is hard because language is not just data—it’s

• Goal: Capture semantic similarity and contextual

• Definition: Lemmatization reduces a word to its lemma

Feature Stemming Lemmatization

• Part-of-Speech taggers are usually learned by HMM

The word is more important if it appears

In a short speech at the Pentagon, President George W

In a short speech at the Pentagon, President George W

speech pentagon president bush troop

2342 8977 7973 3736 2323 7656 78676 12443 7677

Document Representation (Term Vector)

D=< 0, 0, …, 1, 0, 0, …, 1,0, …. , 0 1, 0, ….,1 ,0, 0,0 …>

• Language Modeling: Core task in NLP

• Give a fixed context length (n-gram), predict the next word

J. R. Firth (1890–1960) — British

• Word prediction depends on the aggregated context

• For text generation:

• Language modeled as a probability distribution over word

N-gram model estimates

The general formula for bigrams (n=2):

P(wᵢ | wᵢ₋₁) = Count(wᵢ₋₁, wᵢ) / Count(wᵢ₋₁)

P(wᵢ | wᵢ₋₂, wᵢ₋₁) = Count(wᵢ₋₂, wᵢ₋₁, wᵢ) / Count(wᵢ₋₂, wᵢ₋₁)

• It has many applications including speech recognition,

• Larger n (e.g., 4-grams, 5-grams):

• Despite being replaced by neural models in many

• But how do we know when to stop generation?

You might also like