Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
58 views10 pages

NLP IA1 Question Bank: Concept

The document discusses various concepts in Natural Language Processing (NLP), including Good-Turing smoothing for probability estimation, perplexity as a model evaluation metric, and the stages of NLP such as phonology, morphology, and semantic analysis. It also covers challenges in perplexity, ambiguity at different levels, stemming and lemmatization, and the applications of NLP in areas like chatbots, sentiment analysis, and machine translation. Additionally, it explains n-gram models, the role of finite state automata in morphological analysis, and techniques like Laplace smoothing and regular expressions in text processing.

Uploaded by

archanavce121
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views10 pages

NLP IA1 Question Bank: Concept

The document discusses various concepts in Natural Language Processing (NLP), including Good-Turing smoothing for probability estimation, perplexity as a model evaluation metric, and the stages of NLP such as phonology, morphology, and semantic analysis. It also covers challenges in perplexity, ambiguity at different levels, stemming and lemmatization, and the applications of NLP in areas like chatbots, sentiment analysis, and machine translation. Additionally, it explains n-gram models, the role of finite state automata in morphological analysis, and techniques like Laplace smoothing and regular expressions in text processing.

Uploaded by

archanavce121
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

NLP IA1 Question Bank

1.Good Turing hypothesis

Good-Turing smoothing is a statistical technique used in Natural Language Processing (NLP) to


improve the estimation of the probabilities of unseen events in a dataset. It’s particularly useful in
language modelling, where the occurrence of some words or n-grams might not be observed in the
training data, but we still want to assign them a non-zero probability.
Concept
The basic idea of Good-Turing smoothing is to re-estimate the probability of observed events and
reserve some probability mass for events that have not been observed (unseen events). It does this by
adjusting the counts of observed events based on how many times events with similar counts occur.
Steps Involved:
• Count Frequencies: First, determine how many times each event (e.g., word or n-gram) appears in your
dataset.
• Frequency of Frequencies: Next, calculate the frequency of frequencies, i.e., how many events have
appeared once, twice, three times, etc. This gives you a sense of how common different count values are.
• Adjusted Count Calculation: For an event that appears ccc times, Good-Turing smoothing estimates
the adjusted count c∗ as follows:
1. Perplexity

Interpretation:
• Lower Perplexity: Indicates a better model, as it suggests the model is more confident in its predictions.
• Higher Perplexity: Indicates a worse model, as it suggests the model is more uncertain and less accurate
in its predictions.
Applications:
• Language Model Evaluation: Perplexity is used to compare different language models to determine
which one better predicts the test data.
• Model Tuning: Helps in adjusting hyperparameters and selecting the best model architecture during
training.
Limitations:
• Comparability: Perplexity values are only comparable between models that use the same vocabulary and
test set.
• Sensitivity to Model Size: Larger models might overfit, leading to deceptively low perplexity on the test
set but poor generalization.

2. Challenges in perplexity
Vocabulary Size Impact:
• Challenge: Perplexity is sensitive to the size of the vocabulary. Larger vocabularies can lead to higher
perplexity because there are more potential outcomes to consider, which makes direct comparisons
between models with different vocabularies difficult.
Overfitting:
• Challenge: Models with a large number of parameters may achieve low perplexity on the test set by
overfitting to the data, which can result in poor generalization to unseen data.
Interpretation Across Domains:
• Challenge: Perplexity is not always intuitive across different languages or domains. A model that works
well in one language or type of text (e.g., technical writing) may have different perplexity outcomes in
another, making cross-domain comparisons tricky.
Limited Scope:
• Challenge: Perplexity only measures how well a model predicts a sequence of words, not how well the
model understands or generates meaningful, coherent, and contextually appropriate sentences. Thus, it
doesn't fully capture the quality of the language model.
Dependency on Test Set:
• Challenge: The perplexity of a model is heavily dependent on the specific test set used. If the test set is
not representative of the general language distribution, the perplexity score may not accurately reflect the
model's performance in real-world scenarios.

3.Stages in NLP

Traditionally, NLP, for both spoken and written language has been
regarded as consisting of the following stages:
1. Phonology and Phonetics (Sound Processing): Homophones are words that sound the same but have
different meanings, regardless of spelling. Examples include mean (average) vs. mean (not nice) and week
vs. weak. Rapid speech can cause word boundary detection challenges, as seen in the phrase "a part" (a
piece of something) vs. "apart" (separated).
2. Morphology (Word Forms): Ambiguity arises when root words are modified. For example, "make me
nuts" could mean either preparing a dish or causing annoyance.
3. Lexical Analysis: This involves breaking down text into paragraphs, sentences, and words to analyze
structure and meaning.
4. Syntactic Analysis (Parsing): Words in a sentence are analyzed for grammar and relationships. For
instance, "The school goes to boy" would be rejected as it is grammatically incorrect.
5. Semantic Analysis: This step extracts the exact meaning of words, ensuring the text is meaningful, such
as disregarding the phrase "hot ice-cream."
6. Discourse Integration: The meaning of a sentence is influenced by the sentences before and after it.
7. Pragmatic Analysis: This involves interpreting language based on real-world knowledge to understand
the intended meaning.

3. Ambiguity at each level


NL has an extremely rich form and structure.
It is very ambiguous. There can be different levels of ambiguity −
1. Lexical ambiguity –
Words that sound the same but have a different meaning
• Red and Read
• Flower and Flour
• I and Eye
• Write and Right
2. Syntax Level ambiguity –
− A sentence can be parsed in different
ways. (Structure)
For example, “ Old men and women were taken to safe place”

3. Semantic ambiguity –
Once the word is formed and it’s structure has been detected, sentence processing devotes itself to
meaning extraction. This occurs when the meaning of the words themselves can be misinterpreted
Even after the syntax and the meanings of the individual words have been resolved, there are two
ways of reading the sentence, which results in an ambiguity, which is shown with the help on an
example below:
“Seema loves her mother and Sriya does too”.( The interpretations can be Sriya loves Seema’s
mother or Sriya likes her own mother)
4. Pragmatic ambiguity
Pragmatic ambiguity − (Multiple interpretation) Pragmatic Analysis is part of the process of
extracting information from text. Specifically, it’s the portion that focuses on taking a structures
set of text and figuring out what the actual meaning was. Pragmatic ambiguity refers to a situation
where the context of a phrase gives it multiple interpretation
Example:
Tourist (checking out of the hotel): Waiter, go upstairs to my room and see if my sandals are there;
do not be late; I have to catch the train in 15 minutes. Waiter (running upstairs and coming back
panting): Yes sir,they are there. Clearly, the waiter is falling short of the expectation of the tourist,
since he does not understand the pragmatics of the situation.[3]

5. Referential ambiguity − Referring to something using pronouns. For example, Rima went to Gauri. She said,
“I am tired.” − Exactly who is tired? One input can mean different meanings. Many inputs can mean the same
thing.

4.Stemming and Lemmatization

Stemming: Stemming reduces words to their base form, which may not be a dictionary root. For example,
"connections," "connected," and "connects" stem to "connect," while "trouble," "troubled," and "troubling" stem
to "troubl," even though "troubl" isn’t a recognized word. Stemming is crucial for normalizing text in building
models.
Applications: Stemming is used in information retrieval, text mining, SEO, web search, indexing, and word
analysis. For example, a Google search for "prediction" and "predicted" yields similar results. Examples for the
root word "like" include "likes," "liked," and "likely."

Over stemming: A simple explanation of over-stemming is when you passed words with different meaning but
the stemmer is returning you same stem word for all

Under stemming: Under-stemming is in a way you can say its opposite of over-stemming here you pass words
with same meaning but stemmer returns you the different root words.

5.Porter stemmer algorithm

The Porter stemming algorithm (or ‘Porter stemmer’) is a process for removing the commoner
morphological and inflexional endings from words in English. Its main use is as part of a term
normalisation process that is usually done when setting up Information Retrieval systems •Popular for
information retrieval and text categorization
4. Applications of NLP

• Chatbots and Virtual Assistants: Enhance user interactions by providing automated responses and
support based on natural language input. They streamline customer service and improve accessibility.
• Sentiment Analysis: Analyze social media, reviews, or customer feedback to gauge the overall sentiment
towards products or services. This helps businesses tailor their strategies and improve customer satisfaction.
• Machine Translation: Automatically translate text between different languages, facilitating cross-
language communication. It bridges gaps in international collaboration and content consumption.
• Speech Recognition: Convert spoken language into text, enabling hands-free control and transcription
services. It enhances accessibility for users with disabilities and improves efficiency in transcription tasks.
• Text Classification: Categorize text into predefined categories, useful for spam detection or content
moderation. This helps in organizing large volumes of information and maintaining content quality.
• Named Entity Recognition (NER): Identify and classify entities such as people, organizations, and
locations in text. It aids in organizing information and extracting meaningful insights from documents.
• Information Extraction: Automatically extract relevant data from large volumes of text, useful for
summarizing reports or research papers. It accelerates data analysis and decision-making processes.
• Summarization: Condense long articles or documents into brief summaries while retaining key
information. This makes it easier to grasp essential content quickly and efficiently.
• Question Answering: Provide accurate answers to user queries based on a given text or knowledge base.
It enhances user experience by delivering precise and contextually relevant information.
• Language Generation: Create human-like text for applications such as content creation, dialogue
systems, and creative writing. It facilitates automated content production and interactive storytelling.

5. N gram model
The n-gram model in NLP is a technique for predicting the next word in a sequence based on the previous
n−1n-1n−1 words. It’s widely used for various NLP tasks, including text generation and speech
recognition. Here’s a breakdown:
1. Definition: An n-gram is a contiguous sequence of n items from a given sample of text or speech. These
items can be words, characters, or other tokens.
2. Unigram (1-gram): Considers single words independently. For example, given a text corpus, it calculates
the frequency of each individual word.

3. Bigram (2-gram): Considers pairs of consecutive words. For instance, it estimates the probability of a
word based on the word that precedes it. If the sentence is "The cat," the bigrams would be "The cat."

6. MLE
Maximum Likelihood Estimation (MLE) in NLP is a statistical method used to estimate the parameters of
a probabilistic model based on observed data. It aims to find the parameters that maximize the likelihood
of the observed data under the model. Here’s how MLE is applied in NLP:
7. Examples on N gram
8. Role of FSA in morphological analysis

• Definition: Finite State Automata are computational models used to represent and manipulate regular
languages. They consist of states, transitions between states, and acceptance conditions for recognizing
sequences of symbols (strings).

• Morphological Analysis: In morphological analysis, FSAs are employed to model and analyze the structure
of words, including their prefixes, roots, suffixes, and inflections. They help in understanding and processing
the internal structure of words.

• Lexical Representation: FSAs can represent morphological rules and patterns. For instance, they can encode
the rules for verb conjugation or noun declension in a language. By defining states and transitions based on
morphological rules, FSAs can parse and generate word forms.

• Tokenization: FSAs are used to tokenize text by identifying and separating meaningful units, such as words
or morphemes, based on predefined patterns. This is crucial for breaking down complex words into their
constituent morphemes.

• Stemming and Lemmatization: FSAs can be used in stemming (reducing words to their root forms) and
lemmatization (finding the base or dictionary form of a word). They facilitate the transformation of word forms
into their canonical forms by processing morphological variations.

• Morphological Generation: FSAs enable the generation of word forms based on morphological rules. For
example, they can produce all possible inflections of a word given a set of grammatical rules.

• Efficiency: FSAs are efficient in processing and analyzing morphological structures due to their simple and
deterministic nature. They provide a compact and efficient way to handle regular morphological patterns.
• Integration with Other Models: FSAs can be combined with other models, such as probabilistic models or
neural networks, to enhance morphological analysis. They can provide foundational rules and patterns that more
complex models can build upon.

9. Design FSA
10. Affixes and types of affixes

• Prefix: Added to the beginning of a base word to alter its meaning. For example, in "unhappy," "un-" is a
prefix that negates the meaning of "happy."

• Suffix: Added to the end of a base word to modify its meaning or function. For example, in "happiness," "-
ness" is a suffix that turns the adjective "happy" into a noun.

• Infix: Inserted within a base word, though this is less common in English. For example, in some languages
like Tagalog, infixes can be used to alter the meaning of a root word. In English, infixation is rare but can be
seen in informal language, like "un-freaking-believable."

• Circumfix: Surrounds the base word with both a prefix and a suffix. This type is also more common in
languages other than English. For example, in German, the circumfix "ge-...-t" can be used to form past
participles (e.g., "geliebt" from "lieben").

11. Laplace smoothing


Laplace smoothing, also known as additive smoothing, is a technique used in NLP to handle the problem
of zero probabilities in probabilistic models, especially in language modeling. Here’s how it works and
why it’s important:
Purpose
• Zero Probability Problem: In probabilistic models like n-gram models, some word sequences might not
appear in the training data. This results in zero probabilities for these unseen sequences, which can affect
the performance of the model.
• Improving Probability Estimates: Laplace smoothing adjusts the probabilities to ensure that no event
(such as a word or sequence of words) has zero probability, even if it has not been observed in the training
data.

12. Inflectional and derivational morphology

13. Role of Regular expression

Tokenization: Regex helps in splitting text into tokens, such as words or sentences. For example, a regex
pattern can identify punctuation marks and whitespace to separate text into individual words or sentences.
Text Normalization: Regex is used to normalize text by removing unwanted characters, standardizing
formats, or correcting inconsistencies. For instance, regex can strip out non-alphanumeric characters or
standardize dates.
Pattern Matching: It allows for identifying and extracting specific patterns from text. This includes
detecting email addresses, phone numbers, URLs, or specific keywords within a corpus.
Text Extraction: Regex can extract relevant information from text based on patterns. For example, it can
pull out product codes, dates, or any structured data embedded in unstructured text.
Data Cleaning: In NLP, regex is used to clean and preprocess text data by removing noise, correcting
formatting issues, or handling misspellings and irregularities.
Information Retrieval: Regex helps in refining search queries and matching relevant documents. For
instance, it can be used to filter search results or extract specific sections from documents based on pattern
criteria.

You might also like