Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
30 views5 pages

Practice Set NLP

The document outlines various topics related to Natural Language Processing (NLP), including challenges, preprocessing techniques, syntax, language modeling, and sentiment analysis. It includes specific tasks and questions aimed at understanding concepts such as tokenization, stemming, morphological parsing, and the use of algorithms like CKY and Viterbi. Additionally, it covers practical applications of NLP, such as information extraction and machine translation, along with exercises for calculating similarity and probabilities.

Uploaded by

Rex D
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views5 pages

Practice Set NLP

The document outlines various topics related to Natural Language Processing (NLP), including challenges, preprocessing techniques, syntax, language modeling, and sentiment analysis. It includes specific tasks and questions aimed at understanding concepts such as tokenization, stemming, morphological parsing, and the use of algorithms like CKY and Viterbi. Additionally, it covers practical applications of NLP, such as information extraction and machine translation, along with exercises for calculating similarity and probabilities.

Uploaded by

Rex D
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Practice Set- NLP

Intro to NLP and Challenges (1–4)

1.​ What is Natural Language Processing (NLP)? Provide two real-world examples of NLP
applications and explain their significance.
2.​ Discuss the challenge of lexical ambiguity in NLP. Provide an example sentence with a
polysemous word and explain how it affects processing.
3.​ Why is context understanding a major challenge in NLP? Illustrate with an example of a
sentence where context changes meaning.
4.​ Explain how evolving language (e.g., slang, emojis) poses a challenge for NLP systems.
Suggest one approach to address this issue.

Preprocessing 1: Tokenization and Stopwords (5–8)

5.​ Given the text: "Wow!! NLP is super fun...", tokenize it into lowercase words, remove
punctuation, and list the tokens. [3 Marks]
6.​ What are two types of tokenization techniques? Provide an example of each applied to
the sentence "I’m learning NLP."
7.​ Given the text: "The dog runs in the park.", remove stopwords {the, in, a} and list the
remaining tokens. [3 Marks]
8.​ Explain the impact of stopword removal on text classification. Provide an example where
stopword removal improves model performance.

Preprocessing 2: Stemming and Lemmatization (9–12)

9.​ What is the difference between stemming and lemmatization? Apply Porter Stemmer
and a lemmatizer to the word "running" and compare results.
10.​Given the words {studies, studying, studied}, apply Porter Stemmer to each and list the
stems. [3 Marks]
11.​Lemmatize the words {geese, better, running} using a standard lemmatizer. Assume verb
context for "running". List the lemmas. [3 Marks]
12.​Explain why lemmatization is preferred over stemming in tasks like information retrieval.
Provide an example.

Morphology (13–15)

13.​What is morphological parsing? Provide an example of parsing the word "unhappiness"


into its morphemes.
14.​Explain how Finite State Transducers (FSTs) are used in morphological analysis. Give
an example of an FST rule for pluralization.
15.​Given the word "cats", use an FST to generate its singular form and explain the
transformation steps. [4 Marks]

Syntax and CKY Algorithm (16–19)


16.​What is syntactic parsing? Draw a constituency parse tree for the sentence "The cat
sleeps."
17.​Given the CFG: S → NP VP, NP → Det N, VP → V, Det → the, N → dog, V → barks,
use the CKY algorithm to parse "the dog barks". Show the parsing table. [6 Marks]
18.​Explain why the CKY algorithm requires grammars in Chomsky Normal Form (CNF).
Provide an example of converting a rule to CNF.
19.​Given the sentence "a cat runs" and CFG: S → NP VP, NP → Det N, VP → V, Det → a,
N → cat, V → runs, fill the CKY table to verify if it’s grammatical. [5 Marks]

Language Modeling and N-grams (20–23)

20.​What is a language model? Provide an example of how a trigram model predicts the next
word in a sentence.
21.​Given the corpus: "I eat rice. I eat bread.", list all bigrams and their counts. [4 Marks]
22.​Using the corpus: "the cat runs the dog jumps", calculate the MLE probability of the
bigram P(runs|cat). [4 Marks]
23.​Explain the difference between unigram, bigram, and trigram models. Why do
higher-order n-grams capture more context?

Smoothing (24–26)

24.​Why is smoothing necessary in n-gram models? Provide an example of a


zero-probability issue without smoothing.
25.​Given a unigram model with counts {the: 3, cat: 2, runs: 1}, vocabulary size 4, calculate
Laplace-smoothed probabilities for "cat" and an unseen word "dog". [5 Marks]
26.​Given a bigram model with counts: P(dog|the) = 2/5, P(cat|the) = 3/5, total "the" count =
5, vocabulary size = 3, compute Laplace-smoothed P(cat|the). [4 Marks]

POS Tagging and HMMs (27–30)

27.​What is Part-of-Speech (POS) tagging? Explain how statistical POS taggers use context
to assign tags.
28.​Given an HMM with tags {NOUN, VERB}, transition counts: NOUN→VERB = 2,
NOUN→NOUN = 1, and emission counts: NOUN(cat) = 2, VERB(runs) = 1, calculate
P(VERB|NOUN) and P(cat|NOUN). [5 Marks]
29.​Explain the components of a Hidden Markov Model (HMM) for POS tagging. Provide an
example of states and observations.
30.​Given a tag sequence: [NOUN, VERB, NOUN] and emissions: P(dog|NOUN) = 0.5,
P(runs|VERB) = 0.6, P(cat|NOUN) = 0.4, calculate the probability of the sequence "dog
runs cat". [5 Marks]

Viterbi Algorithm (31–32)

31.​Explain how the Viterbi algorithm decodes the most likely tag sequence in an HMM.
Provide a simple example with two tags.
32.​Given an HMM with tags {NOUN, VERB}, transitions: P(NOUN|NOUN) = 0.6,
P(VERB|NOUN) = 0.4, emissions: P(cat|NOUN) = 0.7, P(runs|VERB) = 0.8, use Viterbi
to find the most likely tag sequence for "cat runs". Show the trellis. [6 Marks]

One-Hot Encoding and BoW (33–35)

33.​Given a vocabulary {apple, banana, orange}, create one-hot encoding vectors for "apple"
and "banana". [4 Marks]
34.​Given the sentence "I eat an apple" and vocabulary {I, eat, an, apple, banana}, create a
Bag-of-Words vector. [4 Marks]
35.​Explain the limitations of one-hot encoding in NLP. Why is it less effective than dense
embeddings for semantic tasks?

TF-IDF (36–37)

36.​Given two documents: Doc1: "I love NLP", Doc2: "NLP is fun", calculate the TF-IDF
score for "NLP" in Doc1. Show TF, IDF, and final score. [6 Marks]
37.​Explain the TF-IDF formula. Why does it prioritize rare terms in document
representation? Provide an example.

Word2Vec and Semantics (38–40)

38.​Explain the core idea behind Word2Vec’s CBOW model. How does it learn word
embeddings from a corpus?
39.​What is Word Sense Disambiguation (WSD)? Apply a simplified Lesk algorithm to
disambiguate "pen" in "She wrote with a pen" using glosses: pen₁ (writing tool), pen₂
(animal enclosure). [5 Marks]
40.​Explain how semantic similarity is measured in Word2Vec. Provide an example of two
words with high similarity.

Sentiment, Classification, and Summarization (41–44)

41.​Given a dataset: Positive: "Great app", Negative: "Bad app", calculate Naive Bayes
likelihoods for "great" and "bad". Classify "Great app" assuming equal priors. [6 Marks]
42.​What is a sentiment lexicon? Provide an example of using a lexicon to classify "This
movie is awesome" as positive.
43.​Explain how TextRank performs extractive summarization. Given sentences S1–S3 with
similarities S1–S2 = 0.5, S1–S3 = 0.3, S2–S3 = 0.4, rank them after one iteration
(damping = 0.85). [6 Marks]
44.​What is text classification? Provide an example of classifying emails as spam or not
spam using Naive Bayes.

Information Extraction and Question Answering (45–47)

45.​What is relation extraction? Provide an example of extracting a "works-at" relation from


"Jane works at IBM."
46.​Given a query: "AI tools" and documents: Doc1: {AI, tech}, Doc2: {AI, tools}, calculate
Jaccard similarity for each. [5 Marks]
47.​Given a QnA system with documents: Doc1: {cat, runs}, Doc2: {dog, jumps}, and query:
{cat}, compute cosine similarity for each document using term frequency vectors.
48.​Given the English sentence "The blue car is big", translate it to Hindi using rule-based
MT. Explain morphological and syntactic transfer rules. [6 Marks]
49.​What are the challenges of multilingual NLP? Provide an example of word order
differences affecting translation.
50.​Explain how dictionary lookup fails in machine translation. Provide an example sentence
where it produces incorrect output.
51.​Explain how the attention mechanism in NMT improves translation quality over Statistical
MT. Provide an example of translating "The cat sleeps" to French, highlighting attention’s
role.
52.​Given an NMT model with a vocabulary of 3 words {the, cat, sleeps} and a test sentence
"the cat sleeps", compute the probability of the target French translation "le chat dort"
using hypothetical softmax outputs: P(le|start) = 0.8, P(chat|le) = 0.7, P(dort|chat) = 0.6.
[5 Marks]
53.​A QnA system uses Jaccard Similarity to retrieve the most relevant document for a
user’s query. The knowledge base and query are preprocessed (stopwords removed):

Knowledge Base Documents:

○​ Doc A: {‘artificial’, ‘intelligence’, ‘systems’}


○​ Doc B: {‘machine’, ‘intelligence’, ‘models’}
○​ Doc C: {‘systems’, ‘data’, ‘processing’}

User Query: Query Q: {‘intelligence’, ‘systems’} Tasks: a) State the formula for Jaccard
Similarity. [2 Marks]​
b) Calculate the Jaccard Similarity between Query Q and each document (Doc A, Doc
B, Doc C). Show step-by-step calculations for intersection and union sizes. [6 Marks]​
c) Which document is retrieved as the most relevant? Explain why based on the
similarity scores. [2 Marks]

54. You are using the VADER sentiment analysis tool to analyze the following tweet:​
“Love the new app’s design, but it crashes constantly!”​
Tasks:​
Token-Level Analysis:

a) Identify the sentiment-bearing words in the tweet. [2 Marks]

b) For each, indicate whether it contributes to positive, negative, or neutral sentiment in


VADER. [2 Marks]​
c) Explain how the word “constantly” impacts the sentiment score of “crashes” in
VADER. [2 Marks]

d) Describe how the conjunction “but” affects the overall sentiment score per VADER’s
rules. [2 Marks]

e) If VADER’s compound score is -0.15, classify the sentiment as positive, negative, or


neutral using VADER thresholds. [1 Mark]​
54. You are tasked with summarizing a news article using the TextRank algorithm. The
article has four sentences:

S1: “AI improves efficiency in manufacturing.”

S2: “Tests were conducted across multiple factories.”

S3: “AI achieved 90% accuracy in quality control.”

S4: “Factories report cost savings with AI.”

Assume cosine similarity scores: S1–S2 = 0.3, S1–S3 = 0.7, S1–S4 = 0.5, S2–S3
= 0.4, S2–S4 = 0.2, S3–S4 = 0.6.​
Tasks:​
a) Construct the sentence similarity graph (nodes = sentences, edges =
similarities). [2 Marks]​
b) Perform two iterations of the TextRank algorithm with a damping factor of
0.85. Show initial scores and calculations. [5 Marks]​
c) Rank the sentences and select the top two for the summary. Explain their
relevance. [3 Marks]

You might also like