Chapter23 - Natural Language Processing
Chapter23 - Natural Language Processing
Winter 2025
The history of language
• Language: A primary trait distinguishing
humans from animals
• Aristotle: Humans as “rational animals” with
speech
• Early signs of abstract thinking:
• Cave paintings as symbolic expression
• Evidence of communication beyond
immediate reality
• Language enabled:
• Imagination
• Storytelling
• Cultural memory
M. Makrehchi 2
The timeline of language
• Spoken language: Emerged around 100,000
years ago
• Written language: Appeared only about 5,000
years ago
• If human history were compressed into 24
hours:
• Writing began just 15 minutes ago
• Implication:
• History began with writing
• Vast majority of human experience is prehistoric
M. Makrehchi 3
Language
M. Makrehchi 4
Language
M. Makrehchi 5
Formal vs. Natural Language
• Natural Language:
• Evolved organically among humans
• Flexible, ambiguous, context-dependent
• Formal Language:
• Designed for precision (e.g., math, programming)
• Rigid syntax, unambiguous semantics
• Key Differences:
• Ambiguity: Natural (yes), Formal (no)
• Redundancy: Natural (high), Formal (low)
• Context-dependence: Natural (yes), Formal (no)
• Tolerance for errors: Natural (yes), Formal (no)
M. Makrehchi 6
Examples of formal and natural languages
• Natural Languages:
• English, Arabic, Mandarin, Farsi, French,
Swahili...
• Evolved through human interaction
• Formal Languages:
• Mathematical notation: ∑, ∫, ∀x ∈ X
• Programming languages: Python, Java, C++
• Logic: Propositional logic (e.g., ¬P ∨ Q), First-
order logic
M. Makrehchi 7
What is Natural Language Processing (NLP)?
M. Makrehchi 8
NLP vs. CL
M. Makrehchi 9
Why is NLP Hard?
• Ambiguity:
• Words and sentences often have multiple meanings
• Context-dependence:
• Meaning changes with context, tone, or world knowledge
• Variability and Creativity:
• Many ways to say the same thing
• Imprecision and Errors:
• Typos, slang, grammatical errors
• Hidden Structure:
• Syntax and semantics are not always explicit
• World Knowledge Required:
• Understanding language often needs common sense or real-world
knowledge
M. Makrehchi 10
Ambiguity
Language is full of ambiguity at every level:
• Lexical:
"bank" (riverbank vs. financial institution)
• Syntactic:
"I saw the man with the telescope."
→ Who had the telescope?
• Semantic:
"Time flies like an arrow."
→ Is “flies” a verb or a noun?
Disambiguating meaning requires both context and
reasoning.
M. Makrehchi 11
Context-Dependence
Words often change meaning based on:
• Situation
• Cultural norms
• Prior sentences
Example:
“He was late. She was furious.” → Furious
about what?
Understanding requires inference beyond the
text.
M. Makrehchi 12
Variability and Creativity
Humans are endlessly creative in how they
express ideas:
• Idioms: “Kick the bucket”
• Sarcasm: “Oh great, another Monday!”
• Code-switching, dialects, and informal speech
all introduce variability that machines struggle
to interpret.
M. Makrehchi 13
Errors and Informality
Real-world language includes:
• Spelling mistakes
• Incomplete sentences
• Slang and emojis
Traditional models expect clean input—real text
is often messy.
M. Makrehchi 14
Hidden Structure
Grammar and meaning are often implicit, not
explicitly marked:
• Unlike programming languages, natural
language lacks fixed structure
• Parsing human sentences correctly is far from
trivial
M. Makrehchi 15
Requires Real-World and Common-Sense
Knowledge
Understanding language often requires knowing things
about the world:
• “He put the ice cream in the oven.” → Seems strange,
but why?
• This kind of inference depends on commonsense
reasoning, something machines still struggle with.
M. Makrehchi 17
Text Classification
• Definition: Automatically assigning predefined
categories or labels to a given text.
• Applications:
• Sentiment analysis: "I love this movie!" →
Positive
• Spam detection: "Congratulations, you've
won a prize!" → Spam
• Topic classification: "Global warming is a
critical issue." → Environment
M. Makrehchi 18
Named Entity Recognition (NER)
• Definition: Identifying and classifying named entities in
text into predefined categories.
• Examples:
• "Barack Obama was born in Hawaii."
• [Barack Obama] → PERSON
• [Hawaii] → LOCATION
• "Apple announced the iPhone 15 in September."
• [Apple] → ORGANIZATION
• [iPhone 15] → PRODUCT
• [September] → DATE
M. Makrehchi 19
Part-of-Speech (POS) Tagging
• Definition: Labeling each word in a sentence
with its corresponding part of speech.
• Example:
• Sentence: "The quick brown fox jumps over
the lazy dog."
• Tags: The/DET quick/ADJ brown/ADJ
fox/NOUN jumps/VERB over/ADP the/DET
lazy/ADJ dog/NOUN
M. Makrehchi 20
Parsing
• Definition: Analyzing the grammatical structure of a
sentence.
• Types:
• Constituency Parsing: Breaks down the sentence into sub-
phrases.
• Dependency Parsing: Shows grammatical relationships
between words.
• Example (Dependency):
• "She enjoys reading books."
• Subject: She → enjoys
• Object: books → reading → enjoys
M. Makrehchi 21
Parsing
• Short-Term Dependency:
• Nearby words affect each other directly
• Example:
“The dog barked loudly.” → “dog” helps predict “barked”
• Long-Term Dependency:
• A word depends on another word far back in the
sentence or text
• Example:
“The book that the professor who the students admired
wrote was fascinating.”
→ “book” connects to “was fascinating”
M. Makrehchi 22
Machine Translation
• Definition: Automatically translating text from one
language to another.
• Example:
• Input: "Hello, how are you?"
• Output (French): "Bonjour, comment ça va ?"
• Applications:
• Travel and tourism
• Global business communication
M. Makrehchi 23
Question Answering (QA)
• Definition: Given a question, the system
provides a concise and correct answer.
• Types:
• Closed-domain: Specific topics (e.g., medical,
legal)
• Open-domain: General knowledge
• Example:
• Question: "What is the capital of Japan?"
• Answer: "Tokyo"
M. Makrehchi 24
Text Summarization
• Definition: Producing a shorter version of a longer text
while retaining its main ideas.
• Types:
• Extractive: Select key sentences
• Abstractive: Generate summary using paraphrased
text
• Example:
• Input: "Artificial intelligence is transforming
industries across the globe..."
• Output: "AI is revolutionizing global industries."
M. Makrehchi 25
Text Generation
• Definition: Automatically generating
coherent and contextually relevant text.
• Examples:
• Input prompt: "Write a story about a
robot that learns to love."
• Output: "Once upon a time, in a distant
future, a lonely robot named Aria..."
M. Makrehchi 26
Coreference Resolution
• Definition: Determining when different expressions
refer to the same entity.
• Example:
• "Ava dropped her phone. She picked it up."
• "She" → Ava
• "it" → phone
• Importance:
• Essential for understanding continuity in text and
dialogue.
M. Makrehchi 27
Word Representation in NLP
• Words must be converted to numbers for machines to
process
• Common types of word representations:
• One-hot (encoding) vectors: Binary, sparse, no semantic
similarity
• Word embeddings: Dense vectors capturing meaning
• Word2Vec, GloVe, FastText
• Contextual embeddings: Word meaning varies with context
• ELMo, BERT, GPT
M. Makrehchi 28
One-Hot Encoding
• Each word is represented by a vector with all
zeros except a one in the position assigned to
that word.
• Example (vocabulary of 5 words):
• “cat” → [0, 1, 0, 0, 0]
• Limitations:
• High dimensional and sparse
• Doesn’t capture meaning or similarity
between words
M. Makrehchi 29
Word Embeddings
• Words are mapped to dense, low-dimensional vectors
based on their usage in language.
• Words with similar meanings have similar vectors.
• Popular models:
• Word2Vec: Predicts a word from its context (or vice versa)
• GloVe: Combines local and global word co-occurrence
• FastText: Includes subword information for better handling of
rare words
• Example:
• vector("king") – vector("man") + vector("woman") ≈
vector("queen")
M. Makrehchi 30
Contextual Embeddings
• Traditional embeddings assign one vector per word,
regardless of context.
• Contextual embeddings adapt based on the sentence.
• Example:
• “bank” in:
• “He sat by the river bank”
• “She went to the bank to deposit money”
• Models like ELMo, BERT, and GPT generate different
vectors for each use based on context.
M. Makrehchi 31
Why word-representation matters?
• Word representations are the input layer for
most NLP models. Better representations lead
to:
• Improved semantic understanding
• More accurate predictions
• Robust performance across tasks
• As models have evolved, contextualized
representations have become the norm, especially
in systems based on transformers and foundation
models.
M. Makrehchi 32
Text Data
• Usually natural language text (human generated)
• Unstructured or semi-structured
• Features or variables are tokens of text (characters,
words, n-grams, sentences, part of speech, named
entities, semantics, and so on) → terms
• Data set: Corpora (corpus)
• Represented by document-term matrix (n x m)
• A very big and sparse matrix
33
Text (Document) Data Representation
• Lexical • Semantic
• Character • Collaborative tagging /
Web2.0
• Words
• Templates / Frames
• Phrases
• Ontologies / First order
• Part-of-speech tags theories
• Syntactic
• Taxonomies / thesauri
• Vector-space model
• Language models
• Full vs. Skeleton parsing
• Cross-modality: in multi-
media data mining
M. Makrehchi 34
Character Level
• Sequences of characters are extracted
• A document is represented by a frequency distribution
of sequences
• Each character sequence of length 1, 2, 3, … represents a
feature with its frequency
• Good and bad sides
• It is very robust since it avoids language morphology: useful
for language identification
• It captures simple patterns on character level: useful for spam
and plagiarism detection
• For deeper semantic tasks, the representation is too weak
Word level
• The most common representation: bag of words,
word embedding, etc
• Important to know:
• Word is well defined unit in western languages – e.g.
Chinese has different notion of semantic unit
• Pre-processing process
• Tokenization
• Converting to lower case?
• Stemming?
• Stop-word reduction?
• Band-pass filtering?
• Document length normalization
Relations between Words (1)
• Local relations:
• Co-occurrence and co-location: two word often
appear together in documents (or in a specific
category of documents)
• Whole document
• A sequence or in a sliding window
• If we see “united” in a document what is the
probability that the next word is “nations” and what
is the probability of “states”?
• We can learn from lots of example documents.
WordNet example 26 relations
116k senses
chicken
Is_a poultry Purpose supply Typ_obj
clean Is_a Quesp
smooth Typ_obj keep
Is_a hen duck
Is_a
Typ_obj Purpose meat
preen Typ_subj Caused_by
Is_a egg
Means quack
Not_is_a plant
chatter Typ_subj animal
Is_a Is_a
Is_a Is_a sense creature
make bird Is_a
Typ_obj sound
gaggle Part feather
Is_a Is_a
Classifier goose relation wing
peck Is_a limb
Is_a
number Typ_subj Is_a
claw
Is_a Means sense Is_a
beak Part Part
hawk Is_a
Typ_obj
strike Typ_subj
fly
leg
turtle catch
Is_a Typ_subj Is_a
bill arm
face Location mouth Is_a opening
Word Frequency
• Word frequencies in texts have power law
distribution:
• Small number of very frequent words
• Big number of low frequency words
39
Zipf’s Law
M. Makrehchi 40
Zipf’s Law
• Even in a very large corpus, there will be a lot
of infrequent words
• The same holds for many other levels of
linguistic structure
• Core NLP challenge: we need to estimate
probabilities or to be able to make predictions
for things we have rarely or never seen
M. Makrehchi 41
Stop-words
• Stop-words are words that from non-linguistic view do
not carry information
• They have mainly functional roles
• Usually we remove them to help the methods to perform
better
• Stop words are language dependent – examples:
• English: A, ABOUT, ABOVE, ACROSS, AFTER, AGAIN, AGAINST,
ALL, ALMOST, ALONE, ALONG, ALREADY
• Dutch: de, en, van, ik, te, dat, die, in, een, hij, het, niet, zijn, is,
was, op, aan, met, als, voor, had, er, maar, om, hem, dan, zou,
of, wat, mijn, men, dit, zo, ...
• Slovenian: A, AH, AHA, ALI, AMPAK, BAJE, BODISI, BOJDA,
BRŽKONE, BRŽČAS, BREZ, CELO, DA, DO, ...
Stemming (1)
• Different forms of the same word are usually
problematic for text data analysis, because they
have different spelling and similar meaning (e.g.
learns, learned, learning,…)
• Stemming is a process of transforming a word into
its stem (normalized form)
• Stemming provides an inexpensive mechanism to
merge
• Online stemming
Stemming (2)
• For English we mostly use Porter stemmer
http://www.tartarus.org/~martin/PorterStemmer/
• Example cascade rules used in English Porter stemmer
• ATIONAL -> ATE relational -> relate
• TIONAL -> TION conditional -> condition
• ENCI -> ENCE valenci -> valence
• ANCI -> ANCE hesitanci -> hesitance
• IZER -> IZE digitizer -> digitize
• ABLI -> ABLE conformabli -> conformable
• ALLI -> AL radicalli -> radical
• ENTLI -> ENT differentli -> different
• ELI -> E vileli -> vile
• OUSLI -> OUS analogousli -> analogous
Lemmatization
M. Makrehchi 45
Stemming vs. lematization
M. Makrehchi 46
Phrase Level
• Instead of having just single words we can deal
with phrases
• We use two types of phrases:
• Phrases as frequent contiguous word sequences
• Phrases as frequent non-contiguous word
sequences
• Both types of phrases could be identified by
simple dynamic programming algorithm
• The main effect of using phrases is to more
precisely identify senses
47
Part-of-Speech level
• Introduces word-types to differentiate words functions
• For text-analysis part-of-speech information is used mainly for
“information extraction” where we are interested in e.g.
named entities which are “noun phrases”
• Another possible use is for feature reduction
• it is known that nouns carry most of the information in text
documents
N
tfidf ( w) = tf . log( )
df ( w)
• tf(w) – term frequency (number of word occurrences in a document)
• df(w) – document frequency (number of documents containing the word)
• N – number of all documents
• tfIdf(w) – relative importance of the word in the document
55
Text Data Example (3)
56
Text Data Example (4)
Terms
T1 T2 T3 T4 …… Tn
Documents D1 1 0 0 0 1
D2 0 1 0 1 0
D3 0 0 0 1 0
D4 1 0 1 0 1
. Binary
Feature
.
.
Dm 0 1 1 0 1
Text Data Example (7)
Terms
T1 T2 T3 T4 …… Tn
Documents D1 0.5 0.1 0.1 0.0 0.7
D2 0.4 1.0 0.2 0.9 0.2
D3 0.1 0.1 0.0 1.0 0.3
D4 0.8 0.0 0.7 0.0 0.6
. Weighted
Feature
.
.
Dm 0.2 1.0 0.9 0.0 0.8
Main Objective in NLP — Language Modeling
M. Makrehchi 61
Language modeling: Statistical Language Models
(SLM)
• Theoretical background: Based on Markov
Assumption
• The Markov assumption: the probability of
an event occurring in the future (in a
sequence) depends only on a finite, fixed
number of past events, rather than the
entire history.
• In other words, it assumes that the future is
conditionally independent of the past,
given a limited "context window" of
previous events.
62
Language modeling: Statistical Language Models
(SLM)
• The level of this assumption is often referred to as the
"order" of the Markov model:
• First-order Markov Model (Markov Chain): The probability of an
event depends only on the immediately preceding event. This is
also known as a "first-order Markov chain.“
• Higher-order Markov Models: The probability of an event depends
on a fixed number of preceding events. For example, in a second-
order Markov model, the probability of an event depends on the
two preceding events, and so on.
63
Language modeling: Neural Language Models
(NLM)
• Theoretical background (linguistics): Distributional
representation (John Rupert Firth , 1957)
• The distributional hypothesis posits that words that
occur in similar contexts tend to have similar
meanings. In other words, words that are used in
similar ways in text are likely to be related in
meaning.
64
Language modeling: Neural Language Models
(NLM)
65
Language modeling: Neural Language Models
(NLM)
66
What is a Language Model?
• A language model estimates the probability of a
sequence of words
• Example: P("The cat sat on the mat") > P("Cat the mat on sat
the")
• Purpose: Capture patterns and structure in human language
• Types:
• Probabilistic Language Models: N-gram models
Neural: RNN, LSTM, Transformer-based models
(e.g., GPT)
M. Makrehchi 67
Language modeling
• Goal: Assigning probability to a sequence of words
• For text understanding:
𝑝(“𝑇ℎ𝑒 𝑐𝑎𝑡 𝑖𝑠 𝑜𝑛 𝑡ℎ𝑒 𝑚𝑎𝑡”) > 𝑝(“𝑇𝑟𝑢𝑐𝑘 𝑡ℎ𝑒 𝑒𝑎𝑟𝑡ℎ 𝑜𝑛”)
M. Makrehchi 68
Modern NLP: Everything as a Text-to-Text
Task
• Modern NLP uses foundation models (e.g., GPT, T5, PaLM)
• Core idea: Convert any NLP task into text-to-text format
• Examples:
• Sentiment Analysis:
The movie’s closing scene is attractive; it was → "good"
• Machine Translation:
“Hello world” in French is → "Bonjour le monde"
• Question Answering:
Which city is UVA located in? → "Charlottesville"
All are reduced to language modeling problems
M. Makrehchi 69
Probabilistic Language Modeling —
Autoregressive Assumption
M. Makrehchi 70
N-grams in Language Modeling
• N-gram = sequence of n words
• Approximates language using the Markov
assumption:
• A word depends only on the last (𝑛– 1) words
• Common examples:
• Unigram: 𝑃(𝑤₁)
• Bigram: 𝑃(𝑤₂ | 𝑤₁)
• Trigram: 𝑃(𝑤₃ | 𝑤₁, 𝑤₂)
• Trade-off:
• Higher n = better context, but needs more data
M. Makrehchi 71
N-grams in Language Modeling
An N-gram is a simple yet powerful model used to estimate the probability
of a word based on the (n–1) previous words. This approach uses the
Markov assumption, which simplifies the complex dependencies in
language by assuming that the current word is only influenced by a limited
history.
M. Makrehchi 72
Example of Trigram (n=3) Language Model
M. Makrehchi 73
Example of Trigram (n=3) Language Model
M. Makrehchi 74
Example of Trigram (n=3) Language Model
M. Makrehchi 75
Example of Trigram (n=3) Language Model
For trigrams:
M. Makrehchi 76
N-Gram generative models
• Language modeling is about determining
probability of a sequence of words
• The task is typically reduced to the estimating
probabilities of a next word given two previous words (tri-
gram model):
Frequencies
of word
sequences
M. Makrehchi 78
N-grams in Language Modeling
• Trade-offs:
• Smaller n (e.g., unigram, bigram):
• Less memory and faster training
• Poor understanding of context
M. Makrehchi 79
N-grams in Language Modeling
• Limitations:
• Can't capture long-range dependencies
• Fails when the sequence has not been seen in training (zero
probabilities)
• Needs smoothing techniques (e.g., Laplace, Kneser-Ney) to adjust
for unseen sequences
M. Makrehchi 80
N-Gram generative models
• Suppose we have a language model that gives us
the estimate of 𝑃 𝑤 𝑥1 , 𝑥2 , … , 𝑥𝑛 we can generate
the next tokens one-by-one!
1. Sampling: 𝑥𝑖 ~𝑃 𝑤 𝑥1 , 𝑥2 , … , 𝑥𝑛
2. 𝑥𝑖 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑤 𝑃 𝑤 𝑥1 , 𝑥2 , … , 𝑥𝑛
M. Makrehchi 81
Softmax: Probability Calculation
Softmax is a mathematical function used to normalize
scores into a valid probability distribution:
Given scores (or logits) → Softmax → Probabilities (sum to
1)
For n-gram models:
• You don't usually use softmax explicitly because prdirectly using
counts.
• Neural language models use softmax to turn raw scores into
probabilities.
• obabilities are computed Softmax is used for:
• Training (predict the next word)
• Evaluating how likely a sentence is (probabilistic modeling)
M. Makrehchi 82
Sampling: Generation Step
• Sampling is what you do after you have probabilities to actually
pick the next word.
• You have two options:
• Greedy: Always choose the word with the highest probability.
• Leads to repetitive, sometimes dull text.
• Sampling:
• Treat the probabilities as a distribution.
• Randomly select the next word according to its probability.
• This introduces variation, creativity, and diversity in
generated text.
M. Makrehchi 83
Sampling: Example
M. Makrehchi 84
N-Gram generative models
• Recursively sample 𝑥𝑖 ~𝑃 𝑤 𝑥1 , 𝑥2 , … , 𝑥𝑛 until we generate
[EOS]
• Generate the first word: “the” 𝑥1 ~𝑝 𝑤 𝐵𝑂𝑆
• Generate the second word: “cat” 𝑥2 ~𝑝 𝑤 ”𝑡ℎ𝑒”
• Generate the third word: “is” 𝑥3 ~𝑝 𝑤 ”𝑡ℎ𝑒𝑐𝑎𝑡”
• Generate the fourth word: “on” 𝑥4 ~𝑝 𝑤 ”𝑡ℎ𝑒𝑐𝑎𝑡𝑖𝑠”
• Generate the fifth word: “the” 𝑥5 ~𝑝 𝑤 ”𝑡ℎ𝑒𝑐𝑎𝑡𝑖𝑠𝑜𝑛”
• Generate the sixth word: “mat” 𝑥6 ~𝑝 𝑤 ”𝑡ℎ𝑒𝑐𝑎𝑡𝑖𝑠𝑜𝑛𝑡ℎ𝑒”
• Generate the sixth word:
“[EOS]” 𝑥7 ~𝑝 𝑤 ”𝑡ℎ𝑒𝑐𝑎𝑡𝑖𝑠𝑜𝑛𝑡ℎ𝑒𝑚𝑎𝑡”
M. Makrehchi 85
Google N-Gram
• Google N-gram
• Some statistics of the corpus:
• File sizes: approx. 24 GB compressed (gzip'ed)
text files
• Number of tokens: 1,024,908,267,229
• Number of sentences: 95,119,665,584
• Number of unigrams: 13,588,391
• Number of bigrams: 314,843,401
• Number of trigrams: 977,069,902
• Number of fourgrams: 1,313,818,354
• Number of fivegrams: 1,176,470,663
86
History of Language Models — The Four Eras
Era Key Models Characteristics
Probabilistic Rule-based, sparse, short
< 2000 models N-gram, HMM
memory
Shallow neural
2000– models Word2Vec, Embeddings, sequential
2018 RNN, GloVe models
Small pre-
2018– trained BERT, GPT-2,
transformers Pre-training + fine-tuning
2022 T5
Large
language GPT-3.5, GPT- Massive scale, general-
2022+
models 4, PaLM purpose LLMs
M. Makrehchi 87