NLP with Python using NLTK
What is NLP (Natural Language Processing)?
• NLP (Natural Language Processing) is a field at the intersection of
computer science, artificial intelligence (AI), and linguistics. It enables
computers to understand, interpret, and generate human language.
• You’ve used NLP if you've:
– Spoken to Alexa, Siri, or Google Assistant
– Typed something and used autocorrect or autocomplete
– Seen spam filters in your email
– Used chatbots or language translation tools
• Popular NLP libraries:
– NLTK (Natural Language Toolkit) – beginner-friendly
– spaCy – fast and industrial-strength
– TextBlob – simple, useful for sentiment analysis
– transformers (by HuggingFace) – for deep learning-based NLP (e.g., BERT, GPT)
What is NLTK?
• NLTK (Natural Language Toolkit) is a powerful Python library
used for working with human language data (text). It provides
easy-to-use tools and resources to process, analyze, and
understand natural language.
Text Preprocessing
• Install NLTK: pip install nltk
• import nltk
• nltk.download('punkt')
• nltk.download('stopwords')
• nltk.download('wordnet’)
• Tokenization: Breaking text into words or sentences.
• Stopwords: Common words (like "the", "is") that are removed before
analysis
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
text = "NLP is fun and powerful!"
tokens = word_tokenize(text)
filtered = [w for w in tokens if w.lower() not in stopwords.words('english')]
print(filtered)
This removes unimportant words so that your analysis focuses on meaningful
content.
Tokenization & Stopwords
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
text = "NLTK is a powerful Python library for NLP."
tokens = word_tokenize(text)
filtered = [word for word in tokens if word.lower() not in
stopwords.words('english')]
print(filtered)
Stemming & Lemmatization
• Stemming: Strips suffixes ("playing" → "play").
• Lemmatization: Reduces to dictionary form ("better" → "good").
• Both are used to normalize text. Lemmatization is more accurate but
slower.
• from nltk.stem import PorterStemmer, WordNetLemmatizer
• nltk.download('wordnet')
• stemmer = PorterStemmer()
• print(stemmer.stem("playing")) # play
• lemmatizer = WordNetLemmatizer()
• print(lemmatizer.lemmatize("playing", pos='v')) # play
POS Tagging & Named Entity Recognition
• POS Tagging: Labels each word (noun, verb, etc.)
• NER: Detects entities like names and places.
• nltk.download('averaged_perceptron_tagger')
• nltk.download('maxent_ne_chunker')
• nltk.download('words')
• sentence = "Steve Jobs founded Apple in California."
• tokens = word_tokenize(sentence)
• tags = nltk.pos_tag(tokens)
• ner_tree = nltk.ne_chunk(tags)
• print(ner_tree)
• This helps in identifying structure and important entities in a
sentence.
Text Classification (Naive Bayes)
• Text Classification: Predicts labels for input text (e.g.,
sentiment).
• Naive Bayes: A simple probabilistic classifier.
• from nltk.classify import NaiveBayesClassifier
def format_sentence(sent):
return {'text': sent.lower()}
train = [(format_sentence("I love this movie"), 'pos'),
(format_sentence("I hate this product"), 'neg')]
classifier = NaiveBayesClassifier.train(train)
print(classifier.classify(format_sentence("love product")))
TF-IDF (Term Frequency-Inverse Document Frequency)
• TF-IDF is a statistical measure used to evaluate how important
a word is in a document relative to a collection of documents
(called a corpus).
• Formula:
• TF-IDF(t,d)=TF(t,d)×IDF(t)
• TF (Term Frequency): How often term t appears in document d
– TF(t,d)=Total terms in d/No. of times t appears in d
– Repeats as long as a condition is True.
• IDF (Inverse Document Frequency): How rare the term is across all
documents
– IDF(t)=log(Number of documents with term t/
Total number of documents))
Why Use TF-IDF?
• Words like “the”, “is”, “and” appear in all documents and carry little
meaning.
• TF-IDF downweights common words and upweights rare, important
ones.
• Example:
• If the word “excellent” appears 3 times in a review but rarely in other
reviews, it will get a high TF-IDF score, showing it's significant for that
specific document.
– from sklearn.feature_extraction.text import TfidfVectorizer
– docs = ["I love NLP", "NLP is fun and useful", "I love machine learning"]
– vectorizer = TfidfVectorizer()
– tfidf_matrix = vectorizer.fit_transform(docs)
– print(tfidf_matrix.toarray())
– print(vectorizer.get_feature_names_out())
Word2Vec
• Word2Vec is a technique to convert words into vectors
(numbers) so that a machine can understand their meaning
based on context. It’s used in NLP for tasks like similarity
detection, text classification, and more.
• Word2Vec trains a shallow neural network to learn word
embeddings using one of two models:
– CBOW (Continuous Bag of Words) – Predicts a word from its
surrounding context.
– Skip-Gram – Predicts context from the target word (works better with
small data).
– Words with similar meanings end up having similar vectors.
Install Required Library
• pip install genism
from gensim.models import Word2Vec
# Example corpus
sentences = [
["i", "love", "nlp"],
["nlp", "is", "fun"],
["i", "enjoy", "machine", "learning"]
]
# Train Word2Vec model
model = Word2Vec(sentences, vector_size=50, window=2, min_count=1, sg=1) # sg=1 uses skip-gram
# Get vector for a word
print("Vector for 'nlp':")
print(model.wv['nlp'])
# Find similar words
print("\nWords similar to 'nlp':")
print(model.wv.most_similar('nlp'))
Explanation of Parameters
• vector_size: Dimension of word embeddings (usually 50–300)
• window: Context window size (how many words to the left/right
to consider)
• min_count: Ignores words that appear less than this number
• sg: 1 for skip-gram, 0 for CBOW
• After training, model.wv['king'] -
model.wv['man'] + model.wv['woman'] gives a
vector close to 'queen’.
• Why Use Word2Vec?
– Captures semantic relationships between words.
– Great for text classification, sentiment analysis,
chatbot development, etc.
Sentiment Analysis
• Sentiment Analysis is the process of
identifying and classifying emotions or
opinions in text — typically as:
– Positive
– Negative
– Neutral
• It's widely used in:
– Product reviews
– Social media monitoring
– Customer feedback analysis