UNIT 5: LANGUAGE MODELING
1. Introduction to Language Modeling
A Language Model (LM) in NLP is a probabilistic statistical model that estimates the likelihood of a
sequence of words. It predicts the next word in a sentence using the context provided by previous
words.
Applications:
- Predictive text input
- Speech recognition
- Spelling correction
- Machine translation
- Chatbots
Example: "I love reading history..." -> next word: "books"
2. N-Gram Models
N-gram = sequence of N words.
- Unigram: "I", "love", "reading"
- Bigram: "I love", "love reading"
- Trigram: "I love reading"
Formula using Chain Rule:
P(W) = P(w1) * P(w2|w1) * P(w3|w1, w2) ...
Approximation: P(w_n | ...) P(w_n | w_{n-k}...)
3. Language Model Evaluation
i. Coverage Rate: % of known n-grams in test data.
ii. Perplexity: Measures model's prediction power.
Perplexity = 2^H(p) or PP(W) = (1/P(w1...wt))^(1/t)
4. Parameter Estimation
i. MLE: P(wi|wi-1,wi-2) = count(wi-2,wi-1,wi) / count(wi-2,wi-1)
ii. Smoothing: Assigns small probabilities to unseen n-grams.
Backoff: Uses lower-order n-grams when data is sparse.
5. Language Model Adaptation
Used when applying models to new domains.
Techniques:
- Interpolation: Mix in-domain and general models
- Topic-based adaptation: Cluster documents into topics
6. Types of Language Models
i. Class-Based: Group words (e.g., cities, animals)
ii. Variable-Length: Handle varying input/output sizes
iii. Discriminative: Focus on classification tasks
iv. Topic-Based (LDA): Discover hidden topics in docs
v. Neural Network Models: Use deep learning (Word2Vec, BERT)
7. Language-Specific Modeling Problems
i. Morphologically Rich: Use morphemes instead of full words
ii. No Word Segmentation: Needed in Chinese, Japanese
iii. Spoken vs Written: Require manual transcription
8. Multilingual and Crosslingual Modeling
i. Multilingual: Handle multiple languages & code-switching
Example: "I need to tell her que no voy a poder ir."
ii. Crosslingual: Use one language's data for another
(Translate or share models like LSA)
Conclusion:
Language modeling is essential in NLP for understanding and generating human language. It
ranges from simple n-grams to advanced neural models.