Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
15 views3 pages

Assignment 3 NLP

Practical

Uploaded by

pranayaws15
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views3 pages

Assignment 3 NLP

Practical

Uploaded by

pranayaws15
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Assignment -03

Q1. Open vs Closed Dictionary

In Natural Language Processing (NLP) and computational linguistics, the concepts of open and closed
dictionaries refer to how vocabulary is managed and extended in language models.

Open Dictionary

An open dictionary, also known as an open vocabulary, allows for the inclusion of new words or
terms that were not previously part of the vocabulary. This is particularly important in dynamic
environments where new words frequently appear, such as social media, news, or scientific texts.

Characteristics:

• Flexibility: Open dictionaries can adapt to new terms and slang, which makes them suitable
for real-time language processing.
• Generalization: They are more general and can handle a wider range of text inputs without
needing frequent manual updates.
• Challenges: They often require mechanisms to handle unknown words, such as using word
embeddings, subword tokenization (e.g., Byte-Pair Encoding, WordPiece), or out-of-
vocabulary (OOV) handling strategies.
• Use Cases: Open dictionaries are used in search engines, social media platforms, and any
application that needs to process and understand evolving language patterns.

Closed Dictionary

A closed dictionary, or closed vocabulary, is a fixed set of words that does not allow the introduction
of new terms beyond those initially defined. It is static and predefined.

Characteristics:

• Predictability: A closed dictionary is limited to a specific set of words, which can make certain
tasks like parsing or tagging more predictable and efficient.
• Simplicity: These dictionaries are easier to manage and often lead to more straightforward
implementations.
• Limitations: They are less effective at handling new words, misspellings, or variations of
words not included in the dictionary.
• Use Cases: Closed dictionaries are suitable for applications where the language is controlled
or specialized, such as in legal documents, technical manuals, or controlled vocabularies in
specific domains like medicine.
Q 2. Edit Distance

Edit distance, also known as Levenshtein distance, is a metric used to measure the difference
between two strings. It is defined as the minimum number of single-character edits required to
transform one string into another. These edits can be insertions, deletions, or substitutions.

Calculation

• Insertion: Adding a character to the string.


• Deletion: Removing a character from the string.
• Substitution: Replacing one character with another.

The edit distance between two strings can be computed using dynamic programming, which builds a
matrix to store the minimum edit distances between all prefixes of the two strings.

Example

For example, to transform the word "kitten" into "sitting", you would need:

Substitute 'k' with 's' (1 substitution)

Substitute 'e' with 'i' (1 substitution)

Insert 'g' at the end (1 insertion)

Thus, the edit distance is 3.

Applications

• Spell Checking: Suggesting corrections for misspelled words by finding words with a small
edit distance.
• DNA Sequencing: Comparing genetic sequences to find similarities and differences.
• Plagiarism Detection: Comparing documents to find copied text with slight modifications.
• Information Retrieval: Finding similar query terms in search engines.

Q3. Smoothing in NLP

Smoothing is a technique used in statistical language modeling to handle the problem of zero
probabilities in unseen events, especially in n-gram models where some word sequences might not
appear in the training data.

Laplace Smoothing (Add-One Smoothing)

Laplace smoothing, also known as add-one smoothing, is a simple technique where 1 is added to the
count of each word or n-gram in the vocabulary. This ensures that no probability is ever zero.

Formula:

\[

P_{\text{Laplace}}(w_i \mid w_{i-1}) = \frac{C(w_{i-1}, w_i) + 1}{C(w_{i-1}) + V}

\]
where \( C(w_{i-1}, w_i) \) is the count of the bigram (or n-gram), \( C(w_{i-1}) \) is the count of the
previous word (or n-gram prefix), and \( V \) is the size of the vocabulary.

• Advantage: It is simple and effective for small datasets or when the vocabulary size is not too
large.
• Disadvantage: It can lead to over-smoothing, where the probabilities of frequent events are
underestimated because of the uniform increment of 1 to all counts.

Good-Turing Discounting

Good-Turing discounting adjusts the estimated probabilities of n-grams based on the frequency of
frequencies, i.e., how often different frequency counts occur in the data.

Key Idea: It reallocates some probability mass from seen to unseen events, based on the idea that
counts observed more frequently are less likely to appear by chance, while counts seen only once
might have just been sampled poorly.

Formula:

\[

P_{\text{GT}}(w_i \mid w_{i-1}) = \frac{(C(w_{i-1}, w_i) + 1) \cdot N_{C(w_{i-1}, w_i) +


1}}{N_{C(w_{i-1}, w_i)} \cdot N}

\]

where \( N_r \) is the number of n-grams that occur exactly \( r \) times, and \( N \) is the total
number of n-grams.

• Advantage: It is more effective than Laplace smoothing in preserving the distribution of


probabilities, especially when there are many low-frequency n-grams.
• Disadvantage: It is computationally more complex and requires good estimation of \( N_r \)
values.

Applications

• Smoothing techniques are crucial in language modeling for tasks like:


• Speech Recognition: Accurately predicting the likelihood of sequences of words.
• Machine Translation: Ensuring all possible translations are considered, even if unseen.
• Text Prediction: Improving the reliability of autocomplete systems by better handling rare
word combinations.

You might also like