0% found this document useful (0 votes)

58 views10 pages

NLP IA1 Question Bank: Concept

The document discusses various concepts in Natural Language Processing (NLP), including Good-Turing smoothing for probability estimation, perplexity as a model evaluation metric, and the stages of NLP such as phonology, morphology, and semantic analysis. It also covers challenges in perplexity, ambiguity at different levels, stemming and lemmatization, and the applications of NLP in areas like chatbots, sentiment analysis, and machine translation. Additionally, it explains n-gram models, the role of finite state automata in morphological analysis, and techniques like Laplace smoothing and regular expressions in text processing.

Uploaded by

archanavce121

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

58 views10 pages

NLP IA1 Question Bank: Concept

Uploaded by

archanavce121

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

NLP IA1 Question Bank

1.Good Turing hypothesis

Good-Turing smoothing is a statistical technique used in Natural Language Processing (NLP) to

improve the estimation of the probabilities of unseen events in a dataset. It’s particularly useful in
language modelling, where the occurrence of some words or n-grams might not be observed in the
training data, but we still want to assign them a non-zero probability.
Concept
The basic idea of Good-Turing smoothing is to re-estimate the probability of observed events and
reserve some probability mass for events that have not been observed (unseen events). It does this by
adjusting the counts of observed events based on how many times events with similar counts occur.
Steps Involved:
• Count Frequencies: First, determine how many times each event (e.g., word or n-gram) appears in your
dataset.
• Frequency of Frequencies: Next, calculate the frequency of frequencies, i.e., how many events have
appeared once, twice, three times, etc. This gives you a sense of how common different count values are.
• Adjusted Count Calculation: For an event that appears ccc times, Good-Turing smoothing estimates
the adjusted count c∗ as follows:
1. Perplexity

Interpretation:
• Lower Perplexity: Indicates a better model, as it suggests the model is more confident in its predictions.
• Higher Perplexity: Indicates a worse model, as it suggests the model is more uncertain and less accurate
in its predictions.
Applications:
• Language Model Evaluation: Perplexity is used to compare different language models to determine
which one better predicts the test data.
• Model Tuning: Helps in adjusting hyperparameters and selecting the best model architecture during
training.
Limitations:
• Comparability: Perplexity values are only comparable between models that use the same vocabulary and
test set.
• Sensitivity to Model Size: Larger models might overfit, leading to deceptively low perplexity on the test
set but poor generalization.

2. Challenges in perplexity
Vocabulary Size Impact:
• Challenge: Perplexity is sensitive to the size of the vocabulary. Larger vocabularies can lead to higher
perplexity because there are more potential outcomes to consider, which makes direct comparisons
between models with different vocabularies difficult.
Overfitting:
• Challenge: Models with a large number of parameters may achieve low perplexity on the test set by
overfitting to the data, which can result in poor generalization to unseen data.
Interpretation Across Domains:
• Challenge: Perplexity is not always intuitive across different languages or domains. A model that works
well in one language or type of text (e.g., technical writing) may have different perplexity outcomes in
another, making cross-domain comparisons tricky.
Limited Scope:
• Challenge: Perplexity only measures how well a model predicts a sequence of words, not how well the
model understands or generates meaningful, coherent, and contextually appropriate sentences. Thus, it
doesn't fully capture the quality of the language model.
Dependency on Test Set:
• Challenge: The perplexity of a model is heavily dependent on the specific test set used. If the test set is
not representative of the general language distribution, the perplexity score may not accurately reflect the
model's performance in real-world scenarios.

3.Stages in NLP

Traditionally, NLP, for both spoken and written language has been
regarded as consisting of the following stages:
1. Phonology and Phonetics (Sound Processing): Homophones are words that sound the same but have
different meanings, regardless of spelling. Examples include mean (average) vs. mean (not nice) and week
vs. weak. Rapid speech can cause word boundary detection challenges, as seen in the phrase "a part" (a
piece of something) vs. "apart" (separated).
2. Morphology (Word Forms): Ambiguity arises when root words are modified. For example, "make me
nuts" could mean either preparing a dish or causing annoyance.
3. Lexical Analysis: This involves breaking down text into paragraphs, sentences, and words to analyze
structure and meaning.
4. Syntactic Analysis (Parsing): Words in a sentence are analyzed for grammar and relationships. For
instance, "The school goes to boy" would be rejected as it is grammatically incorrect.
5. Semantic Analysis: This step extracts the exact meaning of words, ensuring the text is meaningful, such
as disregarding the phrase "hot ice-cream."
6. Discourse Integration: The meaning of a sentence is influenced by the sentences before and after it.
7. Pragmatic Analysis: This involves interpreting language based on real-world knowledge to understand
the intended meaning.

3. Ambiguity at each level

NL has an extremely rich form and structure.
It is very ambiguous. There can be different levels of ambiguity −
1. Lexical ambiguity –
Words that sound the same but have a different meaning
• Red and Read
• Flower and Flour
• I and Eye
• Write and Right
2. Syntax Level ambiguity –
− A sentence can be parsed in different
ways. (Structure)
For example, “ Old men and women were taken to safe place”

3. Semantic ambiguity –
Once the word is formed and it’s structure has been detected, sentence processing devotes itself to
meaning extraction. This occurs when the meaning of the words themselves can be misinterpreted
Even after the syntax and the meanings of the individual words have been resolved, there are two
ways of reading the sentence, which results in an ambiguity, which is shown with the help on an
example below:
“Seema loves her mother and Sriya does too”.( The interpretations can be Sriya loves Seema’s
mother or Sriya likes her own mother)
4. Pragmatic ambiguity
Pragmatic ambiguity − (Multiple interpretation) Pragmatic Analysis is part of the process of
extracting information from text. Specifically, it’s the portion that focuses on taking a structures
set of text and figuring out what the actual meaning was. Pragmatic ambiguity refers to a situation
where the context of a phrase gives it multiple interpretation
Example:
Tourist (checking out of the hotel): Waiter, go upstairs to my room and see if my sandals are there;
do not be late; I have to catch the train in 15 minutes. Waiter (running upstairs and coming back
panting): Yes sir,they are there. Clearly, the waiter is falling short of the expectation of the tourist,
since he does not understand the pragmatics of the situation.[3]

5. Referential ambiguity − Referring to something using pronouns. For example, Rima went to Gauri. She said,
“I am tired.” − Exactly who is tired? One input can mean different meanings. Many inputs can mean the same
thing.

4.Stemming and Lemmatization

Stemming: Stemming reduces words to their base form, which may not be a dictionary root. For example,
"connections," "connected," and "connects" stem to "connect," while "trouble," "troubled," and "troubling" stem
to "troubl," even though "troubl" isn’t a recognized word. Stemming is crucial for normalizing text in building
models.
Applications: Stemming is used in information retrieval, text mining, SEO, web search, indexing, and word
analysis. For example, a Google search for "prediction" and "predicted" yields similar results. Examples for the
root word "like" include "likes," "liked," and "likely."

Over stemming: A simple explanation of over-stemming is when you passed words with different meaning but
the stemmer is returning you same stem word for all

Under stemming: Under-stemming is in a way you can say its opposite of over-stemming here you pass words
with same meaning but stemmer returns you the different root words.

5.Porter stemmer algorithm

The Porter stemming algorithm (or ‘Porter stemmer’) is a process for removing the commoner
morphological and inflexional endings from words in English. Its main use is as part of a term
normalisation process that is usually done when setting up Information Retrieval systems •Popular for
information retrieval and text categorization
4. Applications of NLP

• Chatbots and Virtual Assistants: Enhance user interactions by providing automated responses and
support based on natural language input. They streamline customer service and improve accessibility.
• Sentiment Analysis: Analyze social media, reviews, or customer feedback to gauge the overall sentiment
towards products or services. This helps businesses tailor their strategies and improve customer satisfaction.
• Machine Translation: Automatically translate text between different languages, facilitating cross-
language communication. It bridges gaps in international collaboration and content consumption.
• Speech Recognition: Convert spoken language into text, enabling hands-free control and transcription
services. It enhances accessibility for users with disabilities and improves efficiency in transcription tasks.
• Text Classification: Categorize text into predefined categories, useful for spam detection or content
moderation. This helps in organizing large volumes of information and maintaining content quality.
• Named Entity Recognition (NER): Identify and classify entities such as people, organizations, and
locations in text. It aids in organizing information and extracting meaningful insights from documents.
• Information Extraction: Automatically extract relevant data from large volumes of text, useful for
summarizing reports or research papers. It accelerates data analysis and decision-making processes.
• Summarization: Condense long articles or documents into brief summaries while retaining key
information. This makes it easier to grasp essential content quickly and efficiently.
• Question Answering: Provide accurate answers to user queries based on a given text or knowledge base.
It enhances user experience by delivering precise and contextually relevant information.
• Language Generation: Create human-like text for applications such as content creation, dialogue
systems, and creative writing. It facilitates automated content production and interactive storytelling.

5. N gram model
The n-gram model in NLP is a technique for predicting the next word in a sequence based on the previous
n−1n-1n−1 words. It’s widely used for various NLP tasks, including text generation and speech
recognition. Here’s a breakdown:
1. Definition: An n-gram is a contiguous sequence of n items from a given sample of text or speech. These
items can be words, characters, or other tokens.
2. Unigram (1-gram): Considers single words independently. For example, given a text corpus, it calculates
the frequency of each individual word.

3. Bigram (2-gram): Considers pairs of consecutive words. For instance, it estimates the probability of a
word based on the word that precedes it. If the sentence is "The cat," the bigrams would be "The cat."

6. MLE
Maximum Likelihood Estimation (MLE) in NLP is a statistical method used to estimate the parameters of
a probabilistic model based on observed data. It aims to find the parameters that maximize the likelihood
of the observed data under the model. Here’s how MLE is applied in NLP:
7. Examples on N gram
8. Role of FSA in morphological analysis

• Definition: Finite State Automata are computational models used to represent and manipulate regular
languages. They consist of states, transitions between states, and acceptance conditions for recognizing
sequences of symbols (strings).

• Morphological Analysis: In morphological analysis, FSAs are employed to model and analyze the structure
of words, including their prefixes, roots, suffixes, and inflections. They help in understanding and processing
the internal structure of words.

• Lexical Representation: FSAs can represent morphological rules and patterns. For instance, they can encode
the rules for verb conjugation or noun declension in a language. By defining states and transitions based on
morphological rules, FSAs can parse and generate word forms.

• Tokenization: FSAs are used to tokenize text by identifying and separating meaningful units, such as words
or morphemes, based on predefined patterns. This is crucial for breaking down complex words into their
constituent morphemes.

• Stemming and Lemmatization: FSAs can be used in stemming (reducing words to their root forms) and
lemmatization (finding the base or dictionary form of a word). They facilitate the transformation of word forms
into their canonical forms by processing morphological variations.

• Morphological Generation: FSAs enable the generation of word forms based on morphological rules. For
example, they can produce all possible inflections of a word given a set of grammatical rules.

• Efficiency: FSAs are efficient in processing and analyzing morphological structures due to their simple and
deterministic nature. They provide a compact and efficient way to handle regular morphological patterns.
• Integration with Other Models: FSAs can be combined with other models, such as probabilistic models or
neural networks, to enhance morphological analysis. They can provide foundational rules and patterns that more
complex models can build upon.

9. Design FSA
10. Affixes and types of affixes

• Prefix: Added to the beginning of a base word to alter its meaning. For example, in "unhappy," "un-" is a
prefix that negates the meaning of "happy."

• Suffix: Added to the end of a base word to modify its meaning or function. For example, in "happiness," "-
ness" is a suffix that turns the adjective "happy" into a noun.

• Infix: Inserted within a base word, though this is less common in English. For example, in some languages
like Tagalog, infixes can be used to alter the meaning of a root word. In English, infixation is rare but can be
seen in informal language, like "un-freaking-believable."

• Circumfix: Surrounds the base word with both a prefix and a suffix. This type is also more common in
languages other than English. For example, in German, the circumfix "ge-...-t" can be used to form past
participles (e.g., "geliebt" from "lieben").

11. Laplace smoothing

Laplace smoothing, also known as additive smoothing, is a technique used in NLP to handle the problem
of zero probabilities in probabilistic models, especially in language modeling. Here’s how it works and
why it’s important:
Purpose
• Zero Probability Problem: In probabilistic models like n-gram models, some word sequences might not
appear in the training data. This results in zero probabilities for these unseen sequences, which can affect
the performance of the model.
• Improving Probability Estimates: Laplace smoothing adjusts the probabilities to ensure that no event
(such as a word or sequence of words) has zero probability, even if it has not been observed in the training
data.
•

12. Inflectional and derivational morphology

13. Role of Regular expression

Tokenization: Regex helps in splitting text into tokens, such as words or sentences. For example, a regex
pattern can identify punctuation marks and whitespace to separate text into individual words or sentences.
Text Normalization: Regex is used to normalize text by removing unwanted characters, standardizing
formats, or correcting inconsistencies. For instance, regex can strip out non-alphanumeric characters or
standardize dates.
Pattern Matching: It allows for identifying and extracting specific patterns from text. This includes
detecting email addresses, phone numbers, URLs, or specific keywords within a corpus.
Text Extraction: Regex can extract relevant information from text based on patterns. For example, it can
pull out product codes, dates, or any structured data embedded in unstructured text.
Data Cleaning: In NLP, regex is used to clean and preprocess text data by removing noise, correcting
formatting issues, or handling misspellings and irregularities.
Information Retrieval: Regex helps in refining search queries and matching relevant documents. For
instance, it can be used to filter search results or extract specific sections from documents based on pattern
criteria.

Unit 1-NLP
No ratings yet
Unit 1-NLP
62 pages
Unit1 (Part1)
No ratings yet
Unit1 (Part1)
49 pages
INFOSYS Natural Language Processing
No ratings yet
INFOSYS Natural Language Processing
13 pages
NLP - Notes
No ratings yet
NLP - Notes
17 pages
NLP 2
No ratings yet
NLP 2
13 pages
NLP Unit 1,2 Notes
No ratings yet
NLP Unit 1,2 Notes
37 pages
NLP Sem Questions and Answers
100% (1)
NLP Sem Questions and Answers
72 pages
An Introduction To English Morphology-Famala
100% (1)
An Introduction To English Morphology-Famala
147 pages
Medical Terminology Systems A Body Systems Approach 8th Edition PDF
No ratings yet
Medical Terminology Systems A Body Systems Approach 8th Edition PDF
35 pages
Morphology Final Test Sample 1
100% (2)
Morphology Final Test Sample 1
5 pages
NLP
No ratings yet
NLP
21 pages
Unit 6 Endsem PYQs
No ratings yet
Unit 6 Endsem PYQs
15 pages
Unit 3 NLP
No ratings yet
Unit 3 NLP
103 pages
NLP Concepts and Applications
No ratings yet
NLP Concepts and Applications
10 pages
NLP Unit I Notes-1
75% (4)
NLP Unit I Notes-1
22 pages
CH 1 Introduction To NLP
No ratings yet
CH 1 Introduction To NLP
31 pages
NLP Module 1
No ratings yet
NLP Module 1
10 pages
Chap 2
No ratings yet
Chap 2
70 pages
A Kannada English School Dictionary
No ratings yet
A Kannada English School Dictionary
479 pages
Issues and Challenges in
No ratings yet
Issues and Challenges in
9 pages
NLP Sem Answers (All)
No ratings yet
NLP Sem Answers (All)
124 pages
NLP Unit-I
No ratings yet
NLP Unit-I
7 pages
NLP Module 1
No ratings yet
NLP Module 1
12 pages
Introduction To Natural Language Processing
No ratings yet
Introduction To Natural Language Processing
45 pages
Unit 1 Finding The Structure of Words
No ratings yet
Unit 1 Finding The Structure of Words
31 pages
NLP Conventional
No ratings yet
NLP Conventional
27 pages
History of NLP
No ratings yet
History of NLP
7 pages
Jafilka Mardiani-Exercise 1 Morphsyn
No ratings yet
Jafilka Mardiani-Exercise 1 Morphsyn
5 pages
NLP JNTUH Unit 1
No ratings yet
NLP JNTUH Unit 1
9 pages
Summary of Morphology
No ratings yet
Summary of Morphology
41 pages
Morphology (9077)
No ratings yet
Morphology (9077)
17 pages
NLP Presentation1
No ratings yet
NLP Presentation1
25 pages
NLP Introduction
No ratings yet
NLP Introduction
52 pages
Unit - 3 NLP
No ratings yet
Unit - 3 NLP
15 pages
Learn To Read Biblical Hebrew
82% (11)
Learn To Read Biblical Hebrew
378 pages
SemVII NaturalLanguageProcessing
No ratings yet
SemVII NaturalLanguageProcessing
32 pages
Unit - 1 Introduction
No ratings yet
Unit - 1 Introduction
33 pages
Natural Language Processing Questions
No ratings yet
Natural Language Processing Questions
5 pages
NLP Components and Techniques Guide
No ratings yet
NLP Components and Techniques Guide
26 pages
NLP Key
No ratings yet
NLP Key
16 pages
Ambiguity in Natural Language Processing
No ratings yet
Ambiguity in Natural Language Processing
9 pages
NLP QB
No ratings yet
NLP QB
14 pages
Macro Language & Processor Guide
No ratings yet
Macro Language & Processor Guide
25 pages
PDF NLP
No ratings yet
PDF NLP
7 pages
Unit - 1
No ratings yet
Unit - 1
9 pages
NLP M4 Part 1 SPP
No ratings yet
NLP M4 Part 1 SPP
57 pages
Question Bank NLP SOLUTIONS
No ratings yet
Question Bank NLP SOLUTIONS
21 pages
Notes
No ratings yet
Notes
9 pages
Initiation To Translation WEEK 5
No ratings yet
Initiation To Translation WEEK 5
19 pages
Not Official Answer
No ratings yet
Not Official Answer
40 pages
05-01-2021 Semantic Interpretation, Ambiguity and Disambiguity
No ratings yet
05-01-2021 Semantic Interpretation, Ambiguity and Disambiguity
8 pages
Introduction To Natural Language Processing: Unit 1
No ratings yet
Introduction To Natural Language Processing: Unit 1
60 pages
MCC Ia 2 Answers
No ratings yet
MCC Ia 2 Answers
14 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
17 pages
NLP-PT 1
No ratings yet
NLP-PT 1
15 pages
NLP Ambiguity
No ratings yet
NLP Ambiguity
35 pages
NLP Pyq Solutions
No ratings yet
NLP Pyq Solutions
59 pages
Unit 1 NLP
No ratings yet
Unit 1 NLP
44 pages
6524-Article Text-11850-1-10-20230930
No ratings yet
6524-Article Text-11850-1-10-20230930
10 pages
Practice Morpho
No ratings yet
Practice Morpho
25 pages
Lecture 2
No ratings yet
Lecture 2
8 pages
ML QBF
No ratings yet
ML QBF
13 pages
2019Gr (M) Lec1
No ratings yet
2019Gr (M) Lec1
33 pages
Introduction To NLP and Ambiguity
No ratings yet
Introduction To NLP and Ambiguity
42 pages
Morphological Processes in Anaku Igbo
No ratings yet
Morphological Processes in Anaku Igbo
16 pages
5.2 Natural Language Processing
No ratings yet
5.2 Natural Language Processing
43 pages
Sree+Kumar Nrutya Fake Product+ (1) AL
No ratings yet
Sree+Kumar Nrutya Fake Product+ (1) AL
7 pages
Linguistic Morphology Overview
No ratings yet
Linguistic Morphology Overview
32 pages
Introduction
No ratings yet
Introduction
23 pages
NLP for Information Retrieval
No ratings yet
NLP for Information Retrieval
8 pages
Introduction to NLP: Key Concepts
No ratings yet
Introduction to NLP: Key Concepts
36 pages
Valency-Adjusting Constructions in Rayya Oromo
No ratings yet
Valency-Adjusting Constructions in Rayya Oromo
16 pages
Derivational and Functional Affixes
No ratings yet
Derivational and Functional Affixes
7 pages
End Sem Answer Key 2023
No ratings yet
End Sem Answer Key 2023
4 pages
Linguistic Word Formation Study
No ratings yet
Linguistic Word Formation Study
11 pages
List of English Prefixes With Meanings and Examples
No ratings yet
List of English Prefixes With Meanings and Examples
5 pages
1.2. Morpheme
No ratings yet
1.2. Morpheme
31 pages
Third Year Computer Engg. Courses
No ratings yet
Third Year Computer Engg. Courses
4 pages
Morphological Processes Guide
No ratings yet
Morphological Processes Guide
1 page
Linguistique Anglaise Morphology - b2s4 - Cadu - Lecture Notes
No ratings yet
Linguistique Anglaise Morphology - b2s4 - Cadu - Lecture Notes
5 pages
It-3035 (NLP) - CS End May 2023
No ratings yet
It-3035 (NLP) - CS End May 2023
10 pages
Prepare A Poster To Illustrate How The Turtle Eggs Were
No ratings yet
Prepare A Poster To Illustrate How The Turtle Eggs Were
2 pages
English Morphology Basics
No ratings yet
English Morphology Basics
23 pages
Morphophonemic K10 (Recovered) - Dikonversi
No ratings yet
Morphophonemic K10 (Recovered) - Dikonversi
17 pages
Ai NLP
No ratings yet
Ai NLP
9 pages
How Spelling Supports Reading by Louisa Moats
No ratings yet
How Spelling Supports Reading by Louisa Moats
19 pages
NLP
No ratings yet
NLP
78 pages
Analysis On Filipino Language Morphemes
No ratings yet
Analysis On Filipino Language Morphemes
19 pages
Natural Language Processing 5
No ratings yet
Natural Language Processing 5
24 pages
Module 6 Grammar
No ratings yet
Module 6 Grammar
3 pages
Natural Language Processing
No ratings yet
Natural Language Processing
44 pages
Natural Language Processing
No ratings yet
Natural Language Processing
6 pages
Turkish-English Verb Frames
100% (1)
Turkish-English Verb Frames
57 pages
NLP Course for Students
No ratings yet
NLP Course for Students
25 pages
Morpheme Analysis and Derivation Exercises
No ratings yet
Morpheme Analysis and Derivation Exercises
1 page

NLP IA1 Question Bank: Concept

Uploaded by

NLP IA1 Question Bank: Concept

Uploaded by

NLP IA1 Question Bank

1.Good Turing hypothesis

Good-Turing smoothing is a statistical technique used in Natural Language Processing (NLP) to

3. Ambiguity at each level

4.Stemming and Lemmatization

5.Porter stemmer algorithm

11. Laplace smoothing

12. Inflectional and derivational morphology

13. Role of Regular expression

You might also like