0% found this document useful (0 votes)

15 views3 pages

Assignment 3 NLP

Practical

Uploaded by

pranayaws15

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views3 pages

Assignment 3 NLP

Practical

Uploaded by

pranayaws15

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

Assignment -03

Q1. Open vs Closed Dictionary

In Natural Language Processing (NLP) and computational linguistics, the concepts of open and closed
dictionaries refer to how vocabulary is managed and extended in language models.

Open Dictionary

An open dictionary, also known as an open vocabulary, allows for the inclusion of new words or
terms that were not previously part of the vocabulary. This is particularly important in dynamic
environments where new words frequently appear, such as social media, news, or scientific texts.

Characteristics:

• Flexibility: Open dictionaries can adapt to new terms and slang, which makes them suitable
for real-time language processing.
• Generalization: They are more general and can handle a wider range of text inputs without
needing frequent manual updates.
• Challenges: They often require mechanisms to handle unknown words, such as using word
embeddings, subword tokenization (e.g., Byte-Pair Encoding, WordPiece), or out-of-
vocabulary (OOV) handling strategies.
• Use Cases: Open dictionaries are used in search engines, social media platforms, and any
application that needs to process and understand evolving language patterns.

Closed Dictionary

A closed dictionary, or closed vocabulary, is a fixed set of words that does not allow the introduction
of new terms beyond those initially defined. It is static and predefined.

Characteristics:

• Predictability: A closed dictionary is limited to a specific set of words, which can make certain
tasks like parsing or tagging more predictable and efficient.
• Simplicity: These dictionaries are easier to manage and often lead to more straightforward
implementations.
• Limitations: They are less effective at handling new words, misspellings, or variations of
words not included in the dictionary.
• Use Cases: Closed dictionaries are suitable for applications where the language is controlled
or specialized, such as in legal documents, technical manuals, or controlled vocabularies in
specific domains like medicine.
Q 2. Edit Distance

Edit distance, also known as Levenshtein distance, is a metric used to measure the difference
between two strings. It is defined as the minimum number of single-character edits required to
transform one string into another. These edits can be insertions, deletions, or substitutions.

Calculation

• Insertion: Adding a character to the string.

• Deletion: Removing a character from the string.
• Substitution: Replacing one character with another.

The edit distance between two strings can be computed using dynamic programming, which builds a
matrix to store the minimum edit distances between all prefixes of the two strings.

Example

For example, to transform the word "kitten" into "sitting", you would need:

Substitute 'k' with 's' (1 substitution)

Substitute 'e' with 'i' (1 substitution)

Insert 'g' at the end (1 insertion)

Thus, the edit distance is 3.

Applications

• Spell Checking: Suggesting corrections for misspelled words by finding words with a small
edit distance.
• DNA Sequencing: Comparing genetic sequences to find similarities and differences.
• Plagiarism Detection: Comparing documents to find copied text with slight modifications.
• Information Retrieval: Finding similar query terms in search engines.

Q3. Smoothing in NLP

Smoothing is a technique used in statistical language modeling to handle the problem of zero
probabilities in unseen events, especially in n-gram models where some word sequences might not
appear in the training data.

Laplace Smoothing (Add-One Smoothing)

Laplace smoothing, also known as add-one smoothing, is a simple technique where 1 is added to the
count of each word or n-gram in the vocabulary. This ensures that no probability is ever zero.

Formula:

P_{\text{Laplace}}(w_i \mid w_{i-1}) = \frac{C(w_{i-1}, w_i) + 1}{C(w_{i-1}) + V}

\]
where \( C(w_{i-1}, w_i) \) is the count of the bigram (or n-gram), \( C(w_{i-1}) \) is the count of the
previous word (or n-gram prefix), and \( V \) is the size of the vocabulary.

• Advantage: It is simple and effective for small datasets or when the vocabulary size is not too
large.
• Disadvantage: It can lead to over-smoothing, where the probabilities of frequent events are
underestimated because of the uniform increment of 1 to all counts.

Good-Turing Discounting

Good-Turing discounting adjusts the estimated probabilities of n-grams based on the frequency of
frequencies, i.e., how often different frequency counts occur in the data.

Key Idea: It reallocates some probability mass from seen to unseen events, based on the idea that
counts observed more frequently are less likely to appear by chance, while counts seen only once
might have just been sampled poorly.

Formula:

P_{\text{GT}}(w_i \mid w_{i-1}) = \frac{(C(w_{i-1}, w_i) + 1) \cdot N_{C(w_{i-1}, w_i) +

1}}{N_{C(w_{i-1}, w_i)} \cdot N}

where \( N_r \) is the number of n-grams that occur exactly \( r \) times, and \( N \) is the total
number of n-grams.

• Advantage: It is more effective than Laplace smoothing in preserving the distribution of

probabilities, especially when there are many low-frequency n-grams.
• Disadvantage: It is computationally more complex and requires good estimation of \( N_r \)
values.

Applications

• Smoothing techniques are crucial in language modeling for tasks like:

• Speech Recognition: Accurately predicting the likelihood of sequences of words.
• Machine Translation: Ensuring all possible translations are considered, even if unseen.
• Text Prediction: Improving the reliability of autocomplete systems by better handling rare
word combinations.

NLP Units Iv V
No ratings yet
NLP Units Iv V
30 pages
Unit 2
No ratings yet
Unit 2
7 pages
Natural Language Processing
No ratings yet
Natural Language Processing
28 pages
NLP Sem Questions and Answers
100% (1)
NLP Sem Questions and Answers
72 pages
NLP Sem Answers (All)
No ratings yet
NLP Sem Answers (All)
124 pages
Chap 2
No ratings yet
Chap 2
70 pages
Ai TXT Unit3
No ratings yet
Ai TXT Unit3
22 pages
Statistical Inference
No ratings yet
Statistical Inference
38 pages
Unit 1 NLP KCS072
No ratings yet
Unit 1 NLP KCS072
12 pages
Module 2
No ratings yet
Module 2
26 pages
Ngrams Final
No ratings yet
Ngrams Final
28 pages
AI Unit V
No ratings yet
AI Unit V
64 pages
NLP
No ratings yet
NLP
4 pages
Describe Ambiguity and Its Types
No ratings yet
Describe Ambiguity and Its Types
3 pages
Applications of AI
No ratings yet
Applications of AI
11 pages
Module5 DS PPT
No ratings yet
Module5 DS PPT
38 pages
Unit 2b
No ratings yet
Unit 2b
22 pages
Ngram
No ratings yet
Ngram
41 pages
Final
No ratings yet
Final
14 pages
NLP Revision Notes and Applications
No ratings yet
NLP Revision Notes and Applications
4 pages
Text Representation: Lecture # 6
No ratings yet
Text Representation: Lecture # 6
21 pages
Ai TXT Unit2
No ratings yet
Ai TXT Unit2
14 pages
NLP ANONYMOUS QB Ans
No ratings yet
NLP ANONYMOUS QB Ans
21 pages
Language Model Evaluation Methods
No ratings yet
Language Model Evaluation Methods
21 pages
Lecture 03
No ratings yet
Lecture 03
41 pages
NLP L IA2
No ratings yet
NLP L IA2
23 pages
NLP for Language Model Enthusiasts
No ratings yet
NLP for Language Model Enthusiasts
74 pages
NLP Midterm Spring2025
No ratings yet
NLP Midterm Spring2025
7 pages
II ND Unit NLP
No ratings yet
II ND Unit NLP
21 pages
NLP CH 2
No ratings yet
NLP CH 2
59 pages
Semester Q NLP 2022-23
No ratings yet
Semester Q NLP 2022-23
2 pages
Video v3
No ratings yet
Video v3
34 pages
Lecture 6
No ratings yet
Lecture 6
146 pages
Unit Vapplications Notes
No ratings yet
Unit Vapplications Notes
13 pages
NLP 1
No ratings yet
NLP 1
13 pages
Lecture04-Ngram Lang Models
No ratings yet
Lecture04-Ngram Lang Models
39 pages
Lec 03
No ratings yet
Lec 03
31 pages
NLP Q2 21SAL54 Scheme
No ratings yet
NLP Q2 21SAL54 Scheme
6 pages
It-3035 (NLP) - CS Mid Feb 2024
No ratings yet
It-3035 (NLP) - CS Mid Feb 2024
6 pages
April 22 Part 2achine Translation
No ratings yet
April 22 Part 2achine Translation
36 pages
N Grams
No ratings yet
N Grams
51 pages
Module 05 - Learners Guide
No ratings yet
Module 05 - Learners Guide
31 pages
Introduction To N-Grams and Evaluation
No ratings yet
Introduction To N-Grams and Evaluation
7 pages
NLP Kneserney
No ratings yet
NLP Kneserney
10 pages
NLP Notes-1
No ratings yet
NLP Notes-1
54 pages
N Grams
No ratings yet
N Grams
3 pages
Word Embedding & Language Modelling
No ratings yet
Word Embedding & Language Modelling
111 pages
Unit 6 Endsem PYQs
No ratings yet
Unit 6 Endsem PYQs
15 pages
DSA Module 5 Notes
No ratings yet
DSA Module 5 Notes
23 pages
NLP Unit-II
No ratings yet
NLP Unit-II
20 pages
NLP Sem Unit 5
No ratings yet
NLP Sem Unit 5
9 pages
DSC 202
No ratings yet
DSC 202
8 pages
NLP 1
No ratings yet
NLP 1
8 pages
IJISRT18DC138
No ratings yet
IJISRT18DC138
6 pages
Module 5-Natural Language Processing
No ratings yet
Module 5-Natural Language Processing
13 pages
5.2 Natural Language Processing
No ratings yet
5.2 Natural Language Processing
43 pages
Natural Language Processing 1
No ratings yet
Natural Language Processing 1
19 pages
Pipeline
No ratings yet
Pipeline
9 pages
Reference Material NLP - 2
No ratings yet
Reference Material NLP - 2
40 pages