Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
7 views8 pages

Natural Language Processing - Personal Notes

Uploaded by

xedac78301
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views8 pages

Natural Language Processing - Personal Notes

Uploaded by

xedac78301
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

NLP Study Session - January 2025

Natural Language Processing (NLP)

Teaching Machines to Understand Human Language

Core Challenge: Human language is ambiguous, contextual, and constantly evolving.


NLP bridges the gap between human communication and computer understanding.

1. What is NLP?

Natural Language Processing sits at the intersection of linguistics, computer science, and AI.
It's about making computers understand, interpret, and generate human language in a valuable
way.

Remember: Language is not just words - it's context, culture, emotion, and
meaning!

NLP VS COMPUTATIONAL LINGUISTICS:

NLP: Engineering-focused, build systems that work

Comp Linguistics: Science-focused, understand language computationally

2. The NLP Pipeline

Raw Text → Tokenization → Normalization → POS Tagging → Named


Entity Recognition → Dependency Parsing → Feature Extraction →
Model Application → Output

STEP-BY-STEP BREAKDOWN:

1. Text Preprocessing

Tokenization: Split text into words/subwords/characters

Lowercasing: "Hello" → "hello"


Removing punctuation: "Hello!" → "Hello"

Removing stop words: "the", "is", "at", etc.

Stemming: "running" → "run"

Lemmatization: "better" → "good" (context-aware)

Example Pipeline:
Original: "The cats are running quickly!"
Tokenized: ["The", "cats", "are", "running", "quickly", "!"]
Lowercased: ["the", "cats", "are", "running", "quickly", "!"]
Stop words removed: ["cats", "running", "quickly"]
Lemmatized: ["cat", "run", "quickly"]

3. Traditional NLP Techniques

A. BAG OF WORDS (BOW)

Document = {word1: count1, word2: count2, ...} "I love NLP"


→ {I: 1, love: 1, NLP: 1}

Simple but loses word order and context!

B. TF-IDF (TERM FREQUENCY - INVERSE DOCUMENT


FREQUENCY)

TF-IDF(t,d) = TF(t,d) × IDF(t) TF(t,d) = (# of times term t


appears in doc d) / (total # of terms in d) IDF(t) = log(N
/ df(t)) where N = total documents, df(t) = documents
containing term t

Highlights important words while downweighting common ones!

C. N-GRAMS

Unigram: ["I", "love", "NLP"]

Bigram: ["I love", "love NLP"]

Trigram: ["I love NLP"]

Captures local context but suffers from sparsity!


4. Part-of-Speech (POS) Tagging

Word POS Tag Description

The DT Determiner

cat NN Noun

sits VBZ Verb (3rd person singular)

quietly RB Adverb

5. Named Entity Recognition (NER)

Input: "Apple Inc. was founded by Steve Jobs in Cupertino."


Output:
- Apple Inc. → ORGANIZATION
- Steve Jobs → PERSON
- Cupertino → LOCATION

6. Word Embeddings - The Game Changer

Key Insight: Represent words as dense vectors in high-dimensional space where similar
words are close together!

WORD2VEC

CBOW (Continuous Bag of Words): Predict word from context

Skip-gram: Predict context from word

king - man + woman ≈ queen (Word arithmetic works in


embedding space!)

GLOVE (GLOBAL VECTORS)

Combines global matrix factorization with local context windows

FASTTEXT

Extends Word2Vec by considering character n-grams - handles OOV words!


7. Modern NLP - The Transformer Era

2017: "Attention Is All You Need" paper changed everything!

THE ATTENTION MECHANISM

Attention(Q, K, V) = softmax(QK^T / √d_k) × V Q = Queries K


= Keys V = Values d_k = dimension of keys

BERT (BIDIRECTIONAL ENCODER REPRESENTATIONS FROM


TRANSFORMERS)

Pre-trained on massive text corpus

Bidirectional context understanding

Fine-tune for specific tasks

Masked Language Modeling (MLM) objective

GPT SERIES (GENERATIVE PRE-TRAINED TRANSFORMER)

Autoregressive language modeling

Unidirectional (left-to-right)

Excellent for generation tasks

GPT-3: 175B parameters!

8. Common NLP Tasks

Classification Tasks:

Sentiment Analysis - Positive/Negative/Neutral

Spam Detection - Spam/Not Spam

Topic Classification - News categories

Intent Detection - User's intention


Sequence Labeling Tasks:

NER - Entity recognition

POS Tagging - Grammatical roles

Chunking - Phrase identification

Generation Tasks:

Machine Translation - Language A → Language B

Text Summarization - Long → Short

Question Answering - Context + Question → Answer

Dialogue Systems - Chatbots

9. Evaluation Metrics

FOR CLASSIFICATION:

Accuracy: Overall correctness

Precision/Recall/F1: For imbalanced datasets

ROC-AUC: For binary classification

FOR GENERATION:

BLEU: Machine translation quality

ROUGE: Summarization quality

Perplexity: Language model quality

Human Evaluation: Still the gold standard!

10. Python Libraries & Tools

Essential NLP Toolkit:

NLTK - Classic, educational, lots of resources


spaCy - Fast, production-ready, great for NER/POS

Gensim - Topic modeling, word embeddings

Transformers (HuggingFace) - State-of-the-art models

TextBlob - Simple API for common tasks

Stanford CoreNLP - Java-based, very comprehensive

11. Challenges in NLP

The Hard Problems:

Ambiguity: "I saw her duck" (verb or noun?)

Sarcasm/Irony: "Great, another meeting!" (not actually great)

Context: "Bank" (financial or river?)

Multilingual: Different languages, different structures

Domain Adaptation: Medical vs Legal vs Casual text

Bias: Models learn societal biases from data

12. Advanced Topics

ZERO-SHOT & FEW-SHOT LEARNING

Perform tasks without task-specific training data!

CROSS-LINGUAL MODELS

Models that work across multiple languages (mBERT, XLM-R)

PROMPT ENGINEERING

The art of crafting inputs to get desired outputs from LLMs

RETRIEVAL-AUGMENTED GENERATION (RAG)

Combine retrieval with generation for factual, up-to-date responses


13. Real-World Applications

Industry Applications

Healthcare Clinical notes analysis, drug discovery

Finance Sentiment analysis for trading, document processing

Legal Contract analysis, legal research

Customer Service Chatbots, ticket routing

Education Automated grading, personalized tutoring

14. Code Example - Sentiment Analysis

from transformers import pipeline

# Load pre-trained model


classifier = pipeline("sentiment-analysis")

# Analyze sentiment
texts = [
"I love this product!",
"This is terrible.",
"It's okay, nothing special."
]

results = classifier(texts)
for text, result in zip(texts, results):
print(f"{text} → {result['label']}: {result['score']:.3f}")

15. Future Directions

What's Next in NLP?

Multimodal models (text + image + audio)


More efficient models (smaller, faster)

Better reasoning capabilities

Improved factuality and reduced hallucinations

Personal AI assistants

Real-time translation breaking language barriers

16. Study Tips & Resources

My Learning Path:

1. Master regex and basic text processing

2. Understand traditional methods (BoW, TF-IDF)

3. Learn word embeddings thoroughly

4. Dive into transformers and attention

5. Practice with real datasets (Kaggle, papers)

6. Build end-to-end projects

RECOMMENDED PAPERS:

"Attention Is All You Need" (2017)

"BERT: Pre-training of Deep Bidirectional Transformers" (2018)

"Language Models are Few-Shot Learners" (GPT-3, 2020)

"Language is the foundation of human intelligence - teaching machines to understand it is teaching them to
think"

You might also like