NLP Final Notes
NLP Final Notes
In Natural Language Processing (NLP), ambiguity refers to situations where a word, phrase, or
sentence has multiple possible interpretations. Ambiguity can arise at different levels of
language processing, each presenting unique challenges for machines to resolve. Here, we'll
explain the types of ambiguity associated with each level of NLP, along with examples.
.
1 exical Ambiguity(Word Level)
L
2. Syntactic Ambiguity(Sentence Structure Level)
3. Semantic Ambiguity(Meaning Level)
4. Anaphoric Ambiguity
5. Discourse Ambiguity(Multiple Sentences)
efinition: Lexical ambiguity occurs when a singleword has multiple meanings (polysemy), and
D
the correct meaning depends on the context in which the word is used.
Example:
esolution: The surrounding context (e.g., the word"fish") helps disambiguate the meaning of
R
"bank" as ariverbank.
2. Syntactic Ambiguity (Sentence Structure Level)
efinition: Syntactic ambiguity arises when a sentence can be parsed in more than one way
D
due to its syntactic structure, often due to word order or the lack of punctuation.
Example:
esolution: The ambiguity arises because the phrase"with the telescope" can be interpreted
R
as either a modifier of "man" (describing the man) or as a prepositional phrase indicating the
method of seeing.
Example:
. Interpretation 1: She was reading a book while physicallylocatedon the beach.
1
2. Interpretation 2: She was reading a book aboutbeaches.
efinition: Anaphoric ambiguity arises when a pronounor another referring expression (like
D
"he," "she," "it," "they," etc.) refers to something that could be understood in multiple ways. The
ambiguity occurs due to a lack of clarity about what the pronoun refers to.
Example:
efinition: Pragmatic ambiguity occurs when the meaningof a sentence or phrase depends
D
heavily on the context of the conversation, the speaker's intention, or the broader situational
context.
Example:
S
● entence:"Can you pass the salt?"
● Interpretation 1: The speaker is asking if the listeneris able to pass the salt (a literal
question).
● Interpretation 2: The speaker is requesting the listenerto pass the salt (a polite
command or request).
esolution: The context of the situation resolvesthe ambiguity. In a dinner setting, the phrase
R
is typically interpreted as a polite request rather than a question about the listener’s ability.
MODULE 2
Affixes and Their Types
naffixis a morpheme (the smallest meaningful unit of language) that is attached to a word to
A
modify its meaning or grammatical function. Affixes are not words on their own and must
combine with root words, stems, or bases to form a complete word. They are fundamental to
morphology, the study of word structure.
Types of Affixes
ffixes can be categorized based on their position relative to the root or stem of a word and their
A
function. The major types include:
1. Prefixes
D
● efinition: Affixes that are added to the beginning of a root or stem.
● Function: Often change the meaning of the word but rarely its grammatical category.
● Examples:
○ "Un-" inunhappy(negation)
○ "Re-" inrewrite(repetition)
○ "Pre-" inpreview(time or order)
2. Suffixes
D
● efinition: Affixes that are attached to the end of a root or stem.
● Function: Can change the grammatical category of the word (e.g., noun to adjective) or
its tense, number, case, etc.
● Examples:
○ "-ed" inwalked(past tense)
○ "-ness" inkindness(noun formation)
○ "-ly" inquickly(adverb formation)
3. Infixes
D
● efinition: Affixes that are inserted within a rootor stem.
● Function: Rare in English but common in other languages(e.g., Tagalog, Arabic).
● Examples:
○ In Tagalog:sulat(write) →sumulat(to write).
○ In informal English:abso-bloody-lutely(infixationfor emphasis).
Output of Morphological Analysis
orphological analysis decomposes a word into its components (morphemes) and identifies its
M
base form and grammatical features. Below are examples of morphological analysis for different
word categories:
ord:walked
W
Output:
R
● oot:walk
● Tense:past
● Verb Type:regular
ord:went
W
Output:
R
● oot:go
● Tense:past
● Verb Type:irregular
ord:cat
W
Output:
R
● oot:cat
● Number:singular
● Part of Speech:noun
ord:cats
W
Output:
R
● oot:cat
● Number:plural
● Part of Speech:noun
Role of Finite State Transducer (FST) in Morphological Parsing
Finite State Transducer (FST)is a computational model used formorphological parsing. It
A
maps input strings (surface forms) to their respectivelexical forms(base/root words and
features) by using states and transitions. FSTs are widely used because of their efficiency and
ability to represent complex linguistic rules.
Input Word:cats
FST Representation:
1. States:
○ q0q_0q0: Start state.
○ q1q_1q1: Root word state.
○ q2q_2q2: Suffix state.
2. Transitions:
○ q0→cat/q1q1q_0 \xrightarrow{cat/q_1} q_1q0cat/q1q1: Extract root "cat".
○ q1→s/+PLURALq2q_1 \xrightarrow{s/+PLURAL} q_2q1s/+PLURALq2: Identify
suffix "s" and its feature (plural).
3. Output:
○ Root:cat
○ Number:plural
. E
1 fficiency:Processes words quickly with low computational overhead.
2. Modularity:Easily integrates with other linguisticprocessing tasks.
3. Rule-Based Flexibility:Can handle complex morphological rules (e.g., regular and
irregular inflections).
4. Reversible:Can generate surface forms from lexicalforms.
Example of a Complete FST Process
Word:running
. Input:running
1
2. Steps in Parsing:
○ Detectrunas the root word.
○ Detect-ingas a suffix indicating present participle.
3. Output:
○ Root:run
○ Grammatical Features:progressive
FST Representation:
. q
1 0→run/q1q1q_0 \xrightarrow{run/q_1} q_1q0run/q1q1: Root extraction.
2. q1→ing/+PROGRESSIVEq2q_1 \xrightarrow{ing/+PROGRESSIVE}
q_2q1ing/+PROGRESSIVEq2: Suffix analysis.
S
● tates: Represent conditions or stages (e.g., start,intermediate, and accept states).
● Transitions: Represent the movement between statestriggered by input symbols (e.g.,
letters or words).
● Alphabet: The set of symbols the FSA operates on (e.g.,characters or words).
● Start State: The state where processing begins.
● Accept/Final States: States where the FSA concludes recognizing valid patterns.
ouns in English can be singular or plural, and they often follow specific patterns. FSAs for
N
nouns can recognize these patterns.
A
● lphabet: {boy, girl, book, cats, dogs, ...}
● States:
○ q0q_0q0: Start state.
q
○ 1q_1q1: Intermediate state (recognizing singular nouns).
○ q2q_2q2: Intermediate state (recognizing plural nouns).
○ qfq_fqf: Accept state.
Transition Rules:
. F
1 rom q0q_0q0, if a singular noun (e.g., "boy") is read, move to q1q_1q1.
2. From q0q_0q0, if a plural noun (e.g., "cats") is read, move to q2q_2q2.
3. From q1q_1q1or q2q_2q2, move to qfq_fqf.
erbs can occur in different tenses, forms (e.g., base, past, present participle), and follow
V
specific syntactic rules. FSAs can help identify these forms.
A
● lphabet: {run, runs, running, ran, walk, walked,walking, walks, ...}
● States:
○ q0q_0q0: Start state.
○ q1q_1q1: Base form (e.g., "run").
○ q2q_2q2: Third person singular (e.g., "runs").
○ q3q_3q3: Past form (e.g., "ran").
○ q4q_4q4: Present participle (e.g., "running").
○ qfq_fqf: Accept state.
Transition Rules:
.
1 rom q0q_0q0, if the input is "run" or "walk," move to q1q_1q1.
F
2. From q0q_0q0, if the input is "runs" or "walks," move to q2q_2q2.
3. From q0q_0q0, if the input is "ran" or "walked," move to q3q_3q3.
4. From q0q_0q0, if the input is "running" or "walking," move to q4q_4q4.
5. Any state can transition to qfq_fqfupon recognizing a valid verb form.
4. Combined FSA for Nouns and Verbs
U
● se separate paths in the automaton for noun and verb recognition.
● Add transitions between noun-related states and verb-related states to model sentences
like"The dog runs."
Example:
F
● SAs are simple and efficient for recognizing patterns like noun and verb forms.
● Useful for designing basic parsers in NLP.
● Help detect valid sequences in input, making them ideal for syntax checking.
owever, FSAs have limitations; they cannot handle nested dependencies or context-sensitive
H
structures. For more complex languages,pushdown automata(with a stack) or other models
are used.
Open-Class Words
pen-class words are content words that convey the primary meaning in a sentence. These
O
categories are flexible, and new words are frequently added to them, especially as language
evolves or new concepts arise.
Characteristics:
. S
1 emantic Role: They carry the main meaning or "content"of a sentence.
2. Word Classes:
○ Nouns: Names of people, places, things, or ideas (e.g.,dog, city, happiness).
○ Verbs: Actions, states, or processes (e.g.,run, eat,believe).
○ Adjectives: Describe or modify nouns (e.g.,blue,happy, fast).
Adverbs: Modify verbs, adjectives, or other adverbs (e.g.,quickly, very, always).
○
3. Dynamic Nature:
○ New words are easily added, such as slang, borrowed terms, or technical jargon
(e.g.,selfie, blog, emoji).
Examples:
● ouns:computer, teacher, innovation
N
● Verbs:explore, innovate, dance
● Adjectives:intelligent, red, efficient
● Adverbs:silently, joyfully, often
Closed-Class Words
losed-class words are functional words that serve grammatical or structural roles in a
C
sentence. Their categories are stable, with few or no new additions over time.
Characteristics:
Examples:
● ronouns:you, we, them
P
● Prepositions:with, before, between
● Conjunctions:or, while, although
● Determiners:this, that, several
MODULE 3
challenges of POS Tagging
Context Features:
● he word itself.
T
● Previous word and tag.
● Next word.
● Capitalization, suffixes, prefixes, etc.
Training Data Example:
leep
s VBZ Suffix="s", PreviousTag=NN, IsVerb=true
s
MaxEnt Output:
or each word, compute P(t∣x)P(t | x)P(t∣x) for all possible tags and assign the tag with the
F
highest probability.
1. Feature-Rich:
○ Can incorporate diverse and overlapping features.
○ Handles complex dependencies between features effectively.
2. No Independence Assumptions:
○ Unlike models like Naive Bayes, it does not assume feature independence.
3. Robustness:
○ Can model various distributions without overfitting if features are chosen
carefully.
4. Flexibility:
○ Works with various types of features, including categorical, binary, and
real-valued features.
Disadvantages
Conditional Random Field (CRF)is a statistical modeling method used for structured
A
prediction tasks in natural language processing (NLP). CRFs are specifically designed to model
the relationships between adjacent elements in a sequence while considering the entire input
context. They are widely used for tasks such assequence labeling,POS tagging,named
entity recognition (NER), andshallow parsing.
Top-Down Parser
Definition
top-down parserstarts with the highest-level grammarrule (the root of the parse
A
tree, typically the start symbol SSS) and tries to rewrite it into the input sentence using
grammar rules. It works by breaking down the sentence into smaller components based
on the grammar.
Working Mechanism
1. Initialization:
○ Begin with the start symbol SSS.
2. Expansion:
○ Use grammar rules to expand SSS into its possible derivations.
○ Continue expanding non-terminal symbols (e.g., NP,VPNP, VPNP,VP) until
the leaves of the tree match the input sentence.
3. Matching:
○ Check if the generated parse tree matches the input sentence. If not,
backtrack and try another rule.
4. Output:
○ Return the parse tree if a match is found, or report failure if no valid tree
can be constructed.
Example
.
1 →NP VPS \to NP\ VPS→NP VP
S
2. NP→Det NNP \to Det\ NNP→Det N
3. VP→V NPVP \to V\ NPVP→V NP
4. Det→theDet \to \text{the}Det→the
5. N→dogN \to \text{dog}N→dog
6. V→chasedV \to \text{chased}V→chased
Steps:
. S
1 tart with SSS.
2. Apply S→NP VPS \to NP\ VPS→NP VP.
3. Expand NP→Det NNP \to Det\ NNP→Det N and VP→V NPVP \to V\ NPVP→V
NP.
4. Further expand Det→theDet \to \text{the}Det→the, N→dogN \to
\text{dog}N→dog, and V→chasedV \to \text{chased}V→chased.
5. Match these expansions to the input sentence.
Parse Tree:
S
/ \
NP
VP
/ \
/ \
Det
N V NP
|
| | | \
the
dog chased Det N
|
|
the cat
Advantages
P
● redictive:Can predict what structure is expectednext.
● Logical Flow:Mirrors human thought processes in grammaranalysis.
Disadvantages
B
● acktracking:May need to try multiple rules if theinitial ones fail.
● Inefficiency:Can generate irrelevant or invalid intermediatestructures.
● Left Recursion:Struggles with grammars containingleft-recursive rules, e.g.,
A→AαA \to A \alphaA→Aα.
Bottom-Up Parser
Definition
bottom-up parserstarts with the input sentenceand attempts to construct the parse
A
tree by gradually combining smaller components into larger structures, eventually
reaching the start symbol SSS.
Working Mechanism
1. Initialization:
○ Begin with the words in the input sentence.
2. Reduction:
○ Identify components (e.g., Det,NDet, NDet,N) in the input and apply
grammar rules in reverse to combine them into higher-level constituents.
. Validation:
3
○ Check if combining components leads to the start symbol SSS.
4. Output:
○ Return the parse tree if the start symbol SSS is reached, or report failure.
Example
.
1 →NP VPS \to NP\ VPS→NP VP
S
2. NP→Det NNP \to Det\ NNP→Det N
3. VP→V NPVP \to V\ NPVP→V NP
4. Det→theDet \to \text{the}Det→the
5. N→dogN \to \text{dog}N→dog
6. V→chasedV \to \text{chased}V→chased
Steps:
.
1 tart with words: ["The", "dog", "chased", "the", "cat"].
S
2. Combine "The" and "dog" using NP→Det NNP \to Det\ NNP→Det N.
3. Recognize "chased the cat" as VP→V NPVP \to V\ NPVP→V NP.
4. Combine NPNPNP and VPVPVP into S→NP VPS \to NP\ VPS→NP VP.
Advantages
E
● fficiency:Avoids unnecessary derivations by onlyworking with valid structures.
● Handles Left Recursion:Not affected by left-recursivegrammar rules.
Disadvantages
N
● on-Predictive:Cannot predict the next componentto process.
● Complexity:May combine components incorrectly, leadingto invalid
intermediate structures.
Comparison: Top-Down vs Bottom-Up
Aspect Top-Down Parser Bottom-Up Parser
Starting Point egins with the start symbol Begins with the input sentence.
B
SSS.
andling Left
H annot handle left-recursive Handles left recursion efficiently.
C
Recursion grammars.
● D efinition: Identifying which sense of a word is usedin a given context, especially for
words that have multiple meanings.
● Example: The word“bank”can refer to:
○ A financial institution:I deposited money in thebank.
○ The side of a river:The boat docked at the bank ofthe river.
● Challenge: Correctly choosing the appropriate meaningof ambiguous words based on
surrounding context.
● D efinition: Representing words asdense vectors ofreal numbers that capture their
semantic meaning and relationships with other words.These representations are
learned from large corpora.
● Example:
○ In Word2Vec,king - man + woman ≈ queen, illustratinghow word embeddings
capture relationships between words.
● Challenge: Capturing complex, high-dimensional relationshipsand making them
interpretable for downstream tasks.
● D efinition: Identifying whichwords in a text referto the same entity, such as
pronouns referring to their antecedents.
● Example:
○ Sentence:John went to the store. He bought some apples.
○ Coreference resolution: "He" refers to "John."
● Challenge: Correctly linking pronouns and noun phrasesin longer, more complex texts.
emantic analysis has a wide range of applications across various NLP tasks. Some of the
S
most prominent applications are:
● A pplication: Translates text from one language toanother while preserving the
meaning, context, and relationships between entities.
● Example: Translating"She is going to the market"from English to Spanish ensures that
the correct meaning and grammatical structure are maintained in the target language.
● A pplication: Analyzes social media, reviews, and feedbackto determine customer
sentiments (positive, negative, neutral) towards a product, service, or brand.
● Example: Companies can use sentiment analysis to gaugepublic opinion on their
products by analyzing online reviews and social media posts.
● A pplication: Generates concise and informative summariesof larger texts while
retaining the meaning of the original content.
● Example: A system can automatically summarize a lengthynews article into a short
paragraph that captures the main points.
● A pplication: Categorizes text into predefined classesbased on its content, such as
spam detection, topic categorization, and sentiment categorization.
● Example: Classifying an email as "spam" or "not spam"based on its semantic content.
2.Diff
Example "I saw the man with the telescope." " Bank" (financial institution vs.
riverbank).
esolution
R yntax tree parsing, dependency
S ord sense disambiguation
W
Techniques parsing. (WSD), contextual analysis.
ommon NLP
C entence parsing, machine
S amed Entity Recognition (NER),
N
Tasks Affected translation, and question answering. text classification, and machine
translation.
xample of
E " The chicken is ready to eat."(Is the " Bark"(The sound a dog makes
Ambiguity chicken going to eat, or is the chicken vs. the outer covering of a tree).
ready to be eaten?)
3.Demonstrate lexical semantic analysis using an example
Sentence:
“He went to the bank to fish.”
Analysis Process:
● T okenization: Break the sentence into individual words:["He", "went", "to", "the", "bank",
"to", "fish"].
● Word Sense Disambiguation (WSD): The word “bank” hastwo senses, but the
presence of the word “fish” indicates that the correct sense is theriverbank(side of a
river).
● Contextual Clue: The word “fish” suggests an actionthat typically happens at the side
of a river, not in a financial institution.
● Final Meaning: The word “bank” here refers toriverbank.
exical semantic analysis is the process of analyzing and understanding the meanings of words
L
in a language, particularly focusing on how they are used in context. This involves:
or example, in the sentence"I went to the bank tofish,"lexical semantics focuses on
F
determining that “bank” in this context refers to the side of a river, not a financial institution. This
involves examining the surrounding context, such as the word “fish,” which serves as a clue for
disambiguation.
● D efinition: WSD is the task of identifying the correctmeaning of a word that has multiple
meanings. It is essential in disambiguating words that have more than one possible
interpretation.
● Example: In the sentence"He went to the bank to fish,"the word"bank"has multiple
meanings. WSD helps us identify that in this context,"bank"refers to theriverbank, not
a financial institution.
b. Semantic Role Labeling (SRL)
● D efinition: SRL is the process of identifying theroles that words or phrases play in a
sentence (e.g., agent, theme, goal). It helps determine "who did what to whom" in a
sentence.
● Example: In the sentence"The dog chased the ball,"SRL would identify:
○ Agent: The dog (who performed the action)
○ Theme: The ball (what is being acted upon)
● D efinition: NER involves identifying and classifyingproper nouns into predefined
categories, such as persons, organizations, locations, dates, etc.
● Example: In the sentence"Apple Inc. was founded bySteve Jobs in Cupertino,"NER
would identify:
○ Apple Inc.(Organization)
○ Steve Jobs(Person)
○ Cupertino(Location)
● D efinition: Sentiment analysis involves determiningthe emotional tone of a piece of
text, such as whether it is positive, negative, or neutral.
● Example: In the sentence"I love this phone!", sentimentanalysis would classify the
sentiment aspositive.
● D efinition: Distributional semantics is based on theidea that words with similar
meanings tend to appear in similar contexts. It uses word co-occurrence patterns from
large text corpora to understand the meaning of words.
● Example: By examining how often words like "bank"appear in contexts involving "river,"
"fishing," and "water," the model can determine that "bank" likely refers to ariverbankin
the sentence"He went to the bank to fish."
● D
efinition: Lexical resources are structured databasesthat store the meanings of
words, their relationships, and various semantic properties. Examples include WordNet,
a lexical database that categorizes words and their senses (synonyms, antonyms,
hypernyms).
● E
xample: WordNet might link"bank"(as in a financial institution) with terms like
"money"and"finance", while"bank"(as in a riverbank) might be linked to"river"and
"shore."
● D efinition: Deep learning techniques, particularlyneural networks, are used to model
complex semantic representations of words. Models likeWord2VecorBERTcapture
semantic relationships by learning vector embeddings of words based on large amounts
of text data.
● Example: A deep learning model likeWord2Vecwouldlearn to represent the word
"bank" as a vector that captures both senses of the word: one related tofinanceand the
other related toriverbanks, based on the contextin which the word is used.
● D efinition: These models are used to assign semanticroles (like Agent, Theme, Goal,
etc.) to different parts of a sentence. SRL models typically rely on machine learning
algorithms and large annotated corpora.
● Example: In the sentence"The chef cooked dinner forhis family,"an SRL model would
label:
○ Agent: The chef
○ Action: Cooked
○ Theme: Dinner
○ Goal: For his family
1. Hyponymy
efinition:
D
Hyponymy refers to a relationship between words where the meaning of one word (the
hyponym) is more specific than the other (the hypernym or superclass). Essentially, a hyponym
is a "type of" the hypernym.
Example:
H
● ypernym:Animal
● Hyponyms:Dog, Cat, Elephant, Bird
Features:
H
● ierarchical: Hyponymy represents a hierarchy (e.g.,animal → dog → poodle).
● Asymmetrical: If A is a hyponym of B, then B is nota hyponym of A.
Applications in NLP:
O
● ntology creation (e.g., WordNet)
● Taxonomic classification
● Question answering systems
2. Homonymy
omonymy refers to words that have the same spelling or pronunciationbut have different,
H
unrelated meanings. Homonyms can be further dividedinto:
● H omophones: Words that sound the same but may differin spelling and meaning.
Example:flour(used in baking) vs.flower(a partof a plant).
● Homographs: Words that are spelled the same but maydiffer in pronunciation and
meaning. Example:lead(to guide) vs.lead(a metal).
Example:
Features:
M
● eanings are unrelated.
● Can cause lexical ambiguity.
Applications in NLP:
3. Polysemy
efinition:
D
Polysemy occurs when asingle word has multiple relatedmeanings. Unlike homonyms, the
meanings of polysemous words are connected through a shared origin or context.
Example:
● Bank:
1. A financial institution.
2. The side of a river.
ere, the meanings are related: the financial institution might have originated from "riverbanks"
H
as early trading places.
Features:
M
● eanings are related and share a conceptual base.
● A word can have multiple senses within different contexts.
Applications in NLP:
W
● ord Sense Disambiguation
● Semantic analysis and knowledge graphs
● Language translation
4. Synonymy
efinition:
D
Synonymy is the relationship betweenwords that havethe same or nearly the same
meaning in a particular context.Synonyms are context-sensitive;they are not always
interchangeable.
Example:
B
● igandLarge
● HappyandJoyful
Features:
Applications in NLP:
T
● hesaurus generation (e.g., WordNet)
● Text summarization
● Paraphrase detection and generation
5. Antonymy
efinition:
D
Antonymy refers to words that have opposite meanings.Antonyms can be:
● G radable Antonyms: Words that represent opposite endsof a spectrum (e.g.,hot↔
cold).
● Complementary Antonyms: Words where the presence ofone implies the absence of
the other (e.g.,alive↔dead).
● Relational Antonyms: Words that describe opposite relationships (e.g.,buy↔sell).
Example:
G
● radable:Tall↔Short
● Complementary:True↔False
● Relational:Teacher↔Student
Features:
A
● ntonyms highlight contrasts.
● The relationship is context-dependent.
Applications in NLP:
S
● entiment analysis (e.g.,goodvs.badsentiment).
● Opposite meaning detection in semantic tasks.
4.WSD
5.Yarowsky
Advantages of the Yarowsky Algorithm
1. Semi-Supervised:
○ Requires only a small amount of labeled data, making it efficient for large-scale
tasks with minimal annotation.
2. Bootstrapping:
○ Automatically expands the training set by leveraging unlabeled data, reducing
manual annotation effort.
3. Effective Assumptions:
○ The "one sense per collocation" and "one sense per discourse" principles often
hold true in natural language, improving accuracy.
4. Adaptability:
○ Can be applied to various WSD tasks and other semi-supervised learning
problems.
Applications
W
● ord Sense Disambiguation: Resolving ambiguity inword meanings.
● Information Retrieval: Improving search results by understanding word senses.
● Machine Translation: Disambiguating words to choose the correct translation.
MODULE 5
MODULE 6
1.What is Machine Translation (MT)?
here are several approaches to machine translation, each with its methodologies, strengths,
T
and limitations. These include:
● Overview:
○ RBMT relies on linguistic rules and dictionaries to translate text.
○ It uses grammatical, syntactic, and semantic rules to map text from the source
language to the target language.
○ Requires extensive linguistic knowledge of both languages.
● Components:
○ Lexicon: Dictionary of words and their translations.
○ Syntactic Rules: Define sentence structures in bothlanguages.
○ Semantic Rules: Ensure the meaning of the sentenceis preserved.
● Strengths:
○ High interpretability of translations.
○ Works well for languages with rich linguistic resources.
○ Effective for domain-specific translation with carefully designed rules.
● Limitations:
○ Requires significant manual effort to build rules.
○ Limited scalability for new languages or domains.
○ Struggles with idiomatic expressions and complex sentence structures.
● Overview:
○ SMT is a data-driven approach that uses statistical models to generate
translations based on probabilities.
○ It relies on bilingual parallel corpora to learn mappings between source and
target languages.
● Key Concepts:
○ T ranslation Model: Captures probabilities of translating words or phrases from
the source language to the target language.
○ Language Model: Ensures fluency by modeling the likelihoodof word
sequences in the target language.
○ Decoder: Combines the translation and language models to generate the most
probable target sentence.
● Training Process:
○ Use a parallel corpus to extract word alignments and translation probabilities.
○ Train an n-gram-based language model on target language text.
○ Optimize model parameters to improve translation accuracy.
Strengths:
●
○ Requires no explicit linguistic rules.
○ Adaptable to many languages with sufficient training data.
○ Transparent in terms of probabilities and alignments.
● Limitations:
○ Highly dependent on the availability and quality of bilingual corpora.
○ Struggles with rare words, idiomatic expressions, and long-range dependencies.
○ Outputs may lack fluency and coherence.
● Overview:
○ EBMT translates by reusing examples of translations stored in a database.
○ It relies on the principle that similar input sentences often translate similarly.
● Process:
○ Retrieve similar examples from a translation database.
○ Adapt the examples to create a translation for the input text.
○ Combine fragments of examples to handle complex inputs.
● Strengths:
○ Does not require linguistic rules or probabilistic models.
○ Useful for specific domains with repetitive patterns.
● Limitations:
○ Dependent on the quality and coverage of the example database.
○ May struggle with novel or diverse inputs.
● Overview:
○ NMT uses deep learning models to perform end-to-end translation.
○ It learns to map sentences from the source language to the target language
using large datasets.
● Key Features:
○ Encoder-Decoder Architecture:
■ The encoder processes the source sentence into a fixed-size vector
representation.
■ The decoder generates the target sentence from this representation.
○ Attention Mechanisms: Allow the model to focus on relevant parts of the source
sentence during translation.
○ Transformers: State-of-the-art architecture in NMTthat uses self-attention
mechanisms for better context understanding.
● Training:
○ Requires large amounts of parallel corpora for supervised training.
○ Optimization techniques like stochastic gradient descent (SGD) are used to
minimize translation errors.
Strengths:
●
○ Produces more fluent and coherent translations than SMT.
○ Handles long-range dependencies and context better.
○ Can generalize well to unseen phrases with sufficient training data.
● Limitations:
○ Requires substantial computational resources.
○ Dependent on the quality and quantity of training data.
○ May struggle with low-resource languages.
● Overview:
○ Combines elements of RBMT, SMT, and NMT to leverage their respective
strengths.
○ For example, a system may use rules for grammatical structure and SMT or NMT
for lexical choice.
● Strengths:
○ Provides a balance between rule-based precision and statistical flexibility.
○ Can perform well in low-resource settings by using linguistic rules to supplement
sparse data.
● Limitations:
○ Integration of different approaches can be complex.
○ Requires careful design to avoid conflicts between components.
T
● okenization: Splits the input text into smaller unitslike words, subwords, or characters.
● Lowercasing and Normalization: Converts all text to lowercase and standardizes text
formats (e.g., removing diacritics, handling contractions).
● Language-Specific Processing: Includes stemming, lemmatization,or handling
language-specific grammar rules, if necessary.
● Handling Out-of-Vocabulary Words: For unknown words,subword tokenization (e.g.,
Byte Pair Encoding) or placeholders might be used.
The system analyzes the grammatical and semantic structure of the source sentence:
S
● yntax Analysis: Identifies the sentence structure(e.g., subject, verb, object).
● Morphological Analysis: Understands word forms andtheir grammatical functions
(e.g., verb tense, pluralization).
● Semantic Analysis: Ensures the meaning of the inputsentence is captured accurately.
1. Encoding:
○ The source sentence is converted into a continuous vector representation by the
encoder(a neural network like RNN, LSTM, or Transformer).
○ This representation captures the semantic and syntactic information of the
sentence.
2. Attention Mechanism:
○ Helps the system focus on relevant parts of the source sentence while
generating each word in the target sentence.
3. Decoding:
○ Thedecodergenerates the target sentence word byword based on the encoded
representation and attention context.
4. Post-Processing:
○ The output is detokenized, capitalized, or otherwise normalized into readable
text.
4. Reordering
E
● nglish: "I read a book."
● Japanese: "Watashi wa hon o yomimasu." (Literal: "I a book read.") The system must
adjust the word order in the target language to preserve meaning and grammatical
correctness.
5. Output Postprocessing
D
● etokenization: Combines tokens into complete wordsand sentences.
● Grammatical Adjustment: Corrects minor errors in verbtense, agreement, or
punctuation.
● Handling Unknown Words: Attempts to infer or leave placeholders for untranslated
terms.
Input Sentence:
1. Preprocessing:
○ Tokenization: ["The", "weather", "is", "nice", "today"]
○ Normalization: ["the", "weather", "is", "nice", "today"]
2. Encoding:
○ Sentence encoded as a numerical vector: [0.1, 0.5, ..., 0.9]
3. Attention Mechanism:
○ Focuses on relevant words (e.g., "weather" aligns with "météo").
4. Decoding:
○ Word-by-word generation: "Le", "temps", "est", "agréable", "aujourd'hui."
5. Postprocessing:
○ Detokenization: "Le temps est agréable aujourd'hui."
Challenges in MT Systems
efinition:
D
Information Retrieval focuses on finding and retrieving relevant documents or pieces of
information from large collections of unstructured or semi-structured data (like text, images, or
videos). It does not alter the content but ranks and presents the most relevant items to a user
query.
Key Characteristics:
O
● perates at the document level.
● Concerned with findingrelevantdata, not understandingor extracting specific details
from it.
● Outputs a ranked list of documents or passages related to the user’s query.
Typical Workflow:
1. Indexing:
○ Convert the raw text into a searchable structure (e.g., inverted index).
○ Tokenize text and store term-document relationships for fast lookup.
2. Query Processing:
○ Analyze the user’s query to understand the search intent.
○ Tokenize and normalize the query (e.g., stemming, stopword removal).
3. Matching:
○ C ompare the query against the indexed documents using various similarity
measures (e.g., cosine similarity, TF-IDF).
4. Ranking:
○ Rank documents based on relevance to the query using scoring algorithms (e.g.,
BM25).
. Result Presentation:
5
○ Return the most relevant results to the user, often with snippets.
Examples:
S
● earch engines like Google or Bing.
● Library catalogs or document repositories.
● Question-answering systems returning relevant documents.
Key Techniques:
● B oolean Retrieval: Uses Boolean logic (AND, OR, NOT)for matching queries to
documents.
● Vector Space Models: Represents documents and queriesas vectors and calculates
their similarity.
● Probabilistic Models: Ranks documents based on theprobability of relevance to the
query (e.g., BM25).
● Latent Semantic Analysis (LSA): Captures relationshipsbetween terms and concepts
for better matching.
● Deep Learning in IR: Neural approaches like BERT-basedmodels improve query
understanding and ranking.
efinition:
D
Information Extraction is the process of identifying and extracting structured, meaningful data
from unstructured text. It focuses on understanding the content and deriving specific entities,
relationships, and facts.
Key Characteristics:
● O perates at a finer granularity, often extracting specific pieces of data (e.g., names,
dates, or relationships).
● Aims to transform unstructured data into structured formats (e.g., database entries,
knowledge graphs).
● Outputs structured information like tables, JSON objects, or semantic triples.
Typical Workflow:
1. Preprocessing:
○ Clean and normalize the text (e.g., tokenization, POS tagging).
○ Remove noise like irrelevant characters or stopwords.
2. Entity Recognition:
○ Identify named entities (e.g., people, locations, organizations) using Named
Entity Recognition (NER).
3. Relation Extraction:
○ Identify relationships between entities (e.g., "John works at Microsoft").
4. Event Extraction:
○ Identify events and their attributes (e.g., "A meeting was held on [date] between
[person1] and [person2]").
5. Template Filling:
○ Populate pre-defined templates or databases with extracted information.
6. Output:
○ Present the extracted data in structured formats like CSV, JSON, or RDF.
Examples:
E
● xtracting contact details (email, phone) from resumes.
● Summarizing and extracting financial data from annual reports.
● Populating knowledge graphs with facts from Wikipedia.
Key Techniques:
● amed Entity Recognition (NER): Identifies entitieslike names, dates, and locations.
N
● Dependency Parsing: Analyzes sentence structure toextract relationships.
● Rule-Based Systems: Uses regular expressions and templatesfor extraction.
● Machine Learning-Based IE:
○ Supervised models trained on labeled data.
○ Techniques like Conditional Random Fields (CRFs) for sequential data
extraction.
Deep Learning for IE:
●
○ Transformer-based models like BERT fine-tuned for entity and relation extraction
tasks.
Key Differences Between IR and IE
Aspect Information Retrieval (IR) Information Extraction (IE)
Techniques T
F-IDF, BM25, BERT for query ER, relation extraction, dependency
N
ranking. parsing.
.
1 eb Search: Search engines like Google retrieve relevantweb pages.
W
2. Document Search: Finding relevant research papers,articles, or legal documents.
3. Enterprise Search: Searching internal organizationalrepositories.
4. Question-Answering Systems: Retrieving relevant documentsfor answering user
queries.
1. K nowledge Graph Construction: Extracting facts topopulate knowledge bases like
Google Knowledge Graph.
2. Summarization: Extracting key entities and eventsfor text summarization.
3. Business Intelligence: Extracting trends, financialfigures, or insights from reports.
4. Social Media Analysis: Identifying sentiments, trends,or events from social media
posts.