NLP Study Session - January 2025
Natural Language Processing (NLP)
Teaching Machines to Understand Human Language
Core Challenge: Human language is ambiguous, contextual, and constantly evolving.
NLP bridges the gap between human communication and computer understanding.
1. What is NLP?
Natural Language Processing sits at the intersection of linguistics, computer science, and AI.
It's about making computers understand, interpret, and generate human language in a valuable
way.
Remember: Language is not just words - it's context, culture, emotion, and
meaning!
NLP VS COMPUTATIONAL LINGUISTICS:
NLP: Engineering-focused, build systems that work
Comp Linguistics: Science-focused, understand language computationally
2. The NLP Pipeline
Raw Text → Tokenization → Normalization → POS Tagging → Named
Entity Recognition → Dependency Parsing → Feature Extraction →
Model Application → Output
STEP-BY-STEP BREAKDOWN:
1. Text Preprocessing
Tokenization: Split text into words/subwords/characters
Lowercasing: "Hello" → "hello"
Removing punctuation: "Hello!" → "Hello"
Removing stop words: "the", "is", "at", etc.
Stemming: "running" → "run"
Lemmatization: "better" → "good" (context-aware)
Example Pipeline:
Original: "The cats are running quickly!"
Tokenized: ["The", "cats", "are", "running", "quickly", "!"]
Lowercased: ["the", "cats", "are", "running", "quickly", "!"]
Stop words removed: ["cats", "running", "quickly"]
Lemmatized: ["cat", "run", "quickly"]
3. Traditional NLP Techniques
A. BAG OF WORDS (BOW)
Document = {word1: count1, word2: count2, ...} "I love NLP"
→ {I: 1, love: 1, NLP: 1}
Simple but loses word order and context!
B. TF-IDF (TERM FREQUENCY - INVERSE DOCUMENT
FREQUENCY)
TF-IDF(t,d) = TF(t,d) × IDF(t) TF(t,d) = (# of times term t
appears in doc d) / (total # of terms in d) IDF(t) = log(N
/ df(t)) where N = total documents, df(t) = documents
containing term t
Highlights important words while downweighting common ones!
C. N-GRAMS
Unigram: ["I", "love", "NLP"]
Bigram: ["I love", "love NLP"]
Trigram: ["I love NLP"]
Captures local context but suffers from sparsity!
4. Part-of-Speech (POS) Tagging
Word POS Tag Description
The DT Determiner
cat NN Noun
sits VBZ Verb (3rd person singular)
quietly RB Adverb
5. Named Entity Recognition (NER)
Input: "Apple Inc. was founded by Steve Jobs in Cupertino."
Output:
- Apple Inc. → ORGANIZATION
- Steve Jobs → PERSON
- Cupertino → LOCATION
6. Word Embeddings - The Game Changer
Key Insight: Represent words as dense vectors in high-dimensional space where similar
words are close together!
WORD2VEC
CBOW (Continuous Bag of Words): Predict word from context
Skip-gram: Predict context from word
king - man + woman ≈ queen (Word arithmetic works in
embedding space!)
GLOVE (GLOBAL VECTORS)
Combines global matrix factorization with local context windows
FASTTEXT
Extends Word2Vec by considering character n-grams - handles OOV words!
7. Modern NLP - The Transformer Era
2017: "Attention Is All You Need" paper changed everything!
THE ATTENTION MECHANISM
Attention(Q, K, V) = softmax(QK^T / √d_k) × V Q = Queries K
= Keys V = Values d_k = dimension of keys
BERT (BIDIRECTIONAL ENCODER REPRESENTATIONS FROM
TRANSFORMERS)
Pre-trained on massive text corpus
Bidirectional context understanding
Fine-tune for specific tasks
Masked Language Modeling (MLM) objective
GPT SERIES (GENERATIVE PRE-TRAINED TRANSFORMER)
Autoregressive language modeling
Unidirectional (left-to-right)
Excellent for generation tasks
GPT-3: 175B parameters!
8. Common NLP Tasks
Classification Tasks:
Sentiment Analysis - Positive/Negative/Neutral
Spam Detection - Spam/Not Spam
Topic Classification - News categories
Intent Detection - User's intention
Sequence Labeling Tasks:
NER - Entity recognition
POS Tagging - Grammatical roles
Chunking - Phrase identification
Generation Tasks:
Machine Translation - Language A → Language B
Text Summarization - Long → Short
Question Answering - Context + Question → Answer
Dialogue Systems - Chatbots
9. Evaluation Metrics
FOR CLASSIFICATION:
Accuracy: Overall correctness
Precision/Recall/F1: For imbalanced datasets
ROC-AUC: For binary classification
FOR GENERATION:
BLEU: Machine translation quality
ROUGE: Summarization quality
Perplexity: Language model quality
Human Evaluation: Still the gold standard!
10. Python Libraries & Tools
Essential NLP Toolkit:
NLTK - Classic, educational, lots of resources
spaCy - Fast, production-ready, great for NER/POS
Gensim - Topic modeling, word embeddings
Transformers (HuggingFace) - State-of-the-art models
TextBlob - Simple API for common tasks
Stanford CoreNLP - Java-based, very comprehensive
11. Challenges in NLP
The Hard Problems:
Ambiguity: "I saw her duck" (verb or noun?)
Sarcasm/Irony: "Great, another meeting!" (not actually great)
Context: "Bank" (financial or river?)
Multilingual: Different languages, different structures
Domain Adaptation: Medical vs Legal vs Casual text
Bias: Models learn societal biases from data
12. Advanced Topics
ZERO-SHOT & FEW-SHOT LEARNING
Perform tasks without task-specific training data!
CROSS-LINGUAL MODELS
Models that work across multiple languages (mBERT, XLM-R)
PROMPT ENGINEERING
The art of crafting inputs to get desired outputs from LLMs
RETRIEVAL-AUGMENTED GENERATION (RAG)
Combine retrieval with generation for factual, up-to-date responses
13. Real-World Applications
Industry Applications
Healthcare Clinical notes analysis, drug discovery
Finance Sentiment analysis for trading, document processing
Legal Contract analysis, legal research
Customer Service Chatbots, ticket routing
Education Automated grading, personalized tutoring
14. Code Example - Sentiment Analysis
from transformers import pipeline
# Load pre-trained model
classifier = pipeline("sentiment-analysis")
# Analyze sentiment
texts = [
"I love this product!",
"This is terrible.",
"It's okay, nothing special."
]
results = classifier(texts)
for text, result in zip(texts, results):
print(f"{text} → {result['label']}: {result['score']:.3f}")
15. Future Directions
What's Next in NLP?
Multimodal models (text + image + audio)
More efficient models (smaller, faster)
Better reasoning capabilities
Improved factuality and reduced hallucinations
Personal AI assistants
Real-time translation breaking language barriers
16. Study Tips & Resources
My Learning Path:
1. Master regex and basic text processing
2. Understand traditional methods (BoW, TF-IDF)
3. Learn word embeddings thoroughly
4. Dive into transformers and attention
5. Practice with real datasets (Kaggle, papers)
6. Build end-to-end projects
RECOMMENDED PAPERS:
"Attention Is All You Need" (2017)
"BERT: Pre-training of Deep Bidirectional Transformers" (2018)
"Language Models are Few-Shot Learners" (GPT-3, 2020)
"Language is the foundation of human intelligence - teaching machines to understand it is teaching them to
think"