NLP Laboratory
Module 1
1) Tokenizing -Design a Python program to splitting up a larger body of text into
smaller lines, words or even create words for a non-English language.
import nltk
nltk.download('punkt') # Downloading necessary NLTK data (if not already
downloaded)
from nltk.tokenize import sent_tokenize, word_tokenize
# French text
text = "Bonjour le monde! Ceci est un texte simple. "
# Tokenize sentences
sentences = sent_tokenize(text)
print("Sentences:", sentences)
# Tokenize words
words = word_tokenize(text, language='french')
print("Words:", words)
Output: Sentences: ['Bonjour le monde!', 'Ceci est un texte simple.']
Words: ['Bonjour', 'le', 'monde', '!', 'Ceci', 'est', 'un', 'texte', 'simple', '.']
2) Corpus- Design a Python program to illustrate corpus.
import nltk
from nltk.corpus import brown
nltk.download('brown')
brown_corpus = brown.words()
word_count = brown_corpus.count('the')
print("\nOccurrences of 'the' in the brown corpus", word_count)
collocations = nltk.Text(brown_corpus).collocation_list()
print("\nCollocations in the brown corpus.")
print(collocations[:10])
fdist = nltk.FreqDist(brown_corpus)
print("\nMost common words in the brown corpus.")
print(fdist.most_common(10))
Output: Occurrences of 'the' in the brown corpus 62713
Collocations in the brown corpus.
[('United', 'States'), ('New', 'York'), ('per', 'cent'), ('Rhode', 'Island'), ('years', 'ago'), ('Los',
'Angeles'), ('White', 'House'), ('Peace', 'Corps'), ('World', 'War'), ('San', 'Francisco')]
Most common words in the brown corpus.
[('the', 62713), (',', 58334), ('.', 49346), ('of', 36080), ('and', 27915), ('to', 25732), ('a',
21881), ('in', 19536), ('that', 10237), ('is', 10011)]
3) Lemmatizing- Design a Python program to group together the different inflected
forms of a word so they can be analyzed as a single item.
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
# Sample text
text = "The cats are chasing mice in the garden. "
# Tokenize the text
words = word_tokenize(text)
# Initialize the WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
# Lemmatize each word
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
# Print original and lemmatized words
print("Original words:", words)
print("Lemmatized words:", lemmatized_words)
Output: Original words: ['The', 'cats', 'are', 'chasing', 'mice', 'in', 'the', 'garden', '.']
Lemmatized words: ['The', 'cat', 'are', 'chasing', 'mouse', 'in', 'the', 'garden', '.']
4) Process-Implement a python program to process the given text.
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('stopwords')
text = "Hello! This is a sample text. It includes punctuation marks, like commas,
periods, and exclamation marks!"
# Lowercasing and removing punctuation
text = text.lower()
text = ''.join(char for char in text if char not in string.punctuation)
# Tokenization
tokens = word_tokenize(text)
# Removing stop words
stop_words = set(stopwords.words('english'))
tokens = [token for token in tokens if token not in stop_words]
print("Processed text:", tokens)
Output: Processed text: ['hello', 'sample', 'text', 'includes', 'punctuation', 'marks', 'like',
'commas', 'periods', 'exclamation', 'marks']
Module 2
1) Getting text to analyze- Design a Python program to analyze the given text
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
text = "I absolutely love this movie! the acting is fantastic and the storyline is
captivating"
sid= SentimentIntensityAnalyzer()
sentiment_scores = sid.polarity_scores(text)
if sentiment_scores['compound'] >= 0.05:
sentiment = "Positive"
elif sentiment_scores['compound'] <= -0.05:
sentiment = "Negative"
else:
sentiment = "Neutral"
print("Text: ", text)
print("Sentiment: ", sentiment)
print("Sentiment scores: ", sentiment_scores)
Output: Text: I absolutely love this movie! the acting is fantastic and the storyline is
captivating
Sentiment: Positive
Sentiment scores: {'neg': 0.0, 'neu': 0.567, 'pos': 0.433, 'compound': 0.855}
2) POS Tagger- Design python program to perform part-of-speech tagging on the
text scraped from a website.
import requests
from bs4 import BeautifulSoup
url = "https://www.snickers.com/" # Replace with the actual website URL
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
text = soup.get_text() # Extract text from HTML
# Tokenize the text
tokens = nltk.word_tokenize(text)
# Perform part-of-speech tagging
tagged_tokens = nltk.pos_tag(tokens)
print(tagged_tokens)
Output: [('Just', 'RB'), ('a', 'DT'), ('moment', 'NN'), ('...', ':'), ('Enable', 'JJ'), ('JavaScript',
'NNP'), ('and', 'CC'), ('cookies', 'NNS'), ('to', 'TO'), ('continue', 'VB')]
3) Default Tagger- Design python program to illustrate default tagger.
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import DefaultTagger
text = "This is an example sentence for illustrating default tagger"
words = word_tokenize(text)
default_tagger = DefaultTagger('NN')
tagged_words = default_tagger.tag(words)
print("Tagged words: ")
for word, tag in tagged_words:
print(f"{word} : {tag}")
Output: Tagged words:
This : NN
is : NN
an : NN
example : NN
sentence : NN
for : NN
illustrating : NN
default : NN
tagger : NN
4) Chunking- Design a python program to group similar words together based on the
nature of the word.
import nltk
from nltk import word_tokenize, pos_tag
def pos_tagging(text):
tokens = word_tokenize(text)
tagged_words = nltk.pos_tag(tokens)
return tagged_words
def group_similar_words(tagged_words):
grouped_words ={}
for word, pos_tag in tagged_words:
if pos_tag not in grouped_words:
grouped_words[pos_tag] = []
grouped_words[pos_tag].append(word)
return grouped_words
text = "The cat is chasing the mouse. A dog is barking loudly."
tagged_words = pos_tagging(text)
grouped_words = group_similar_words(tagged_words)
for pos_tag, words in grouped_words.items():
print(f"POS Tag: {pos_tag}, Words: {words} ")
Output: POS Tag: DT, Words: ['The', 'the', 'A']
POS Tag: NN, Words: ['cat', 'mouse', 'dog']
POS Tag: VBZ, Words: ['is', 'is']
POS Tag: VBG, Words: ['chasing', 'barking']
POS Tag: ., Words: ['.', '.']
POS Tag: RB, Words: ['loudly']
5) Chinking- Design a Python program to remove a sequence of tokens from a chunk.
def chink_text(text, chink_pattern):
tokens = nltk.word_tokenize(text)
chinked_tokens = [(word, tag) for (word,tag) in
nltk.pos_tag(nltk.word_tokenize(chink_pattern))]
tagged_tokens = nltk.pos_tag(tokens)
cleaned_tokens = [token for token in tagged_tokens if token not in chinked_tokens]
cleaned_text = " ".join([word for word, _ in cleaned_tokens])
return cleaned_text
if __name__ == "__main__":
text = "The quick brown fox jumps over the lazy dog"
chink_pattern = "quick brown fox"
cleaned_text = chink_text(text, chink_pattern)
print("Cleaned Text: ")
print(cleaned_text)
Output: Cleaned Text:
The jumps over the lazy dog
Module 3
1) N grams- Implement a Python program to implement N-Gram
def generate_ngrams(text,n):
words = text.split()
ngrams = []
for i in range(len(words)-n+1):
ngrams.append(words[i:i+n])
return ngrams
text = "This is a sample text for generating n-grams"
n=3
result = generate_ngrams(text,n)
print(result)
Output: [['This', 'is', 'a'], ['is', 'a', 'sample'], ['a', 'sample', 'text'], ['sample', 'text', 'for'],
['text', 'for', 'generating'], ['for', 'generating', 'n-grams']]
2) Smoothing-Design a Python program to perform smoothing using various
methods in Python.
def laplace_smoothing(word_counts, vocab_size):
smoothed_counts = {}
total_words = sum(word_counts.values())
for word, count in word_counts.items():
smoothed_counts[word] = (count+1)/(total_words + vocab_size)
return smoothed_counts
word_counts = {'apple':3,'banana':2, 'orange':1}
vocab_size =100
smoothed_count=laplace_smoothing(word_counts,vocab_size)
print('Original count',word_counts)
print("Smooth count",smoothed_count)
Output: Original count {'apple': 3, 'banana': 2, 'orange': 1}
Smooth count {'apple': 0.03773584905660377, 'banana': 0.02830188679245283,
'orange': 0.018867924528301886}
3) Good turing- Develop a Python program to calculate good turing frequency.
from collections import Counter
def good_turing(frequencies):
freq_of_freq = Counter(frequencies)
good_turing_frequencies = {}
for freq, freq_count in freq_of_freq.items():
if freq+1 in freq_of_freq:
good_turing_frequencies[freq] = (freq +1) * (freq_of_freq[freq +1]/freq_count)
else:
good_turing_frequencies[freq] = freq_count/len(frequencies)
return good_turing_frequencies
text = "The quick brown fox jumps over the lazy dog"
word_lengths = [len(word) for word in text.split()]
word_length_counts = Counter(word_lengths)
good_turing_frequencies = good_turing(word_lengths)
print("Word Length\tFrequency\tGood-turing Frequency")
for length, freq in word_length_counts.items():
gt_freq = good_turing_frequencies[length]
print(f"{length}\t\t{freq}\t\t{gt_freq}")
Output:
Word Length Frequency Good-turing Frequency
3 4 2.0
5 3 0.3333333333333333
4 2 7.5
Module 4
1) Lexical Semantics- Design Python program to do text classification.
# Import necessary libraries
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Load the 20 newsgroups dataset
newsgroups = fetch_20newsgroups(subset='all')
categories = newsgroups.target_names
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(newsgroups.data, newsgroups.target,
test_size=0.3, random_state=42)
# Initialize the TF-IDF Vectorizer and transform the data
vectorizer = TfidfVectorizer(stop_words='english', max_df=0.7)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)
# Train the Logistic Regression classifier
classifier = LogisticRegression(max_iter=1000, random_state=42)
classifier.fit(X_train_tfidf, y_train)
# Predict the labels for the test set and print accuracy
y_pred = classifier.predict(X_test_tfidf)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")
# Function to classify new text data
def classify_text(text):
text_tfidf = vectorizer.transform([text])
prediction = classifier.predict(text_tfidf)
return categories[prediction[0]]
# Example usage
new_text = "NASA launches a new space mission."
predicted_category = classify_text(new_text)
print(f"The predicted category for the new text is: {predicted_category}")
Output: Accuracy: 90.13%
The predicted category for the new text is: sci.space
2) Meaning Representation- Implement a Python program to represent the meaning
of the given text.
import nltk
def represent_meaning(text):
"""
This function takes a sentence and returns a simple dictionary
representing its meaning based on part-of-speech tags.
"""
# Download nltk resources if not already installed
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
# Tokenize the sentence
tokens = nltk.word_tokenize(text)
# Get part-of-speech tags for each token
tags = nltk.pos_tag(tokens)
# Create a dictionary to represent meaning
meaning = {}
for token, tag in tags:
if tag.startswith('VB'): # Verbs
meaning['action'] = token
elif tag.startswith('NN'): # Nouns
if 'subject' not in meaning:
meaning['subject'] = token
else:
meaning['object'] = token
elif tag.startswith('JJ'): # Adjectives
meaning['adjective'] = token
return meaning
# Example usage
sentence = "The cat chased the mouse."
meaning = represent_meaning(sentence)
print(sentence)
print(meaning)
Output: The cat chased the mouse.
{'subject': 'cat', 'action': 'chased', 'object': 'mouse'}
3) Disambiguity-Design the lesk algorithm in Python to handle word sense
disambiguation.
# Import necessary libraries
import nltk
from nltk.corpus import wordnet as wn
from nltk.tokenize import word_tokenize
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')
# Function to compute overlap between context and gloss
def compute_overlap(context, gloss):
context = set(context)
gloss = set(gloss)
return len(context & gloss)
# Lesk algorithm for word sense disambiguation
def lesk_algorithm(word, sentence):
# Tokenize the sentence and get the context
context = word_tokenize(sentence)
# Get all synsets (senses) of the word
synsets = wn.synsets(word)
if not synsets:
return None
# Initialize variables to keep track of the best sense and max overlap
best_sense = synsets[0]
max_overlap = 0
for sense in synsets:
# Get the gloss of the current sense and tokenize it
gloss = word_tokenize(sense.definition())
# Compute the overlap between context and gloss
overlap = compute_overlap(context, gloss)
# Update best_sense if the current sense has more overlap
if overlap > max_overlap:
max_overlap = overlap
best_sense = sense
return best_sense
# Example usage
sentence = "I went to the bank to deposit my money."
word = "money"
sense = lesk_algorithm(word, sentence)
if sense:
print(f"The best sense for '{word}' in the sentence is: {sense.name()}")
print(f"Definition: {sense.definition()}")
else:
print(f"No senses found for the word '{word}'.")
Output: The best sense for 'money' in the sentence is: money.n.03
Definition: the official currency issued by a government or national bank
Module 5
1) Information Extraction- Design Python programs to extract structured
information from unstructured information.
import spacy
import email
from email.policy import default
# Check if spaCy model is installed and download if necessary
try:
nlp = spacy.load("en_core_web_sm")
except OSError:
import subprocess
import sys
subprocess.check_call([sys.executable, '-m', 'spacy', 'download', 'en_core_web_sm'])
nlp = spacy.load("en_core_web_sm")
def named_entity_recognition(text):
"""
Perform Named Entity Recognition (NER) on the given text using spaCy.
Args:
text (str): The text to process.
Returns:
list: A list of tuples containing entities and their labels.
"""
# Process the text
doc = nlp(text)
# Extract named entities
entities = [(ent.text, ent.label_) for ent in doc.ents]
return entities
# Sample email content
email_content = """From:
[email protected] To:
[email protected] Subject: Sample Email
This is a sample email for testing information extraction.
"""
# Parse the email content
msg = email.message_from_string(email_content, policy=default)
# Extract and print email fields
from_address = msg['From']
to_address = msg['To']
subject = msg['Subject']
body = msg.get_body(preferencelist=('plain')).get_content()
print("Named Entity Recognition with spaCy:")
entities = named_entity_recognition(body)
for entity, label in entities:
print(f"Entity: {entity}, Label: {label}")
print("\nExtracting Information from Emails:")
print(f"From: {from_address}")
print(f"To: {to_address}")
print(f"Subject: {subject}")
print(f"Body:\n{body}")
Output: Named Entity Recognition with spaCy:
Extracting Information from Emails:
From:
[email protected] To:
[email protected] Subject: Sample Email
Body:
This is a sample email for testing information extraction.
2) Filtering Stop Words- Implement a python program to filtering stopwords.
import nltk
nltk.download('punkt')
nltk.download('stopwords')
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
# Download NLTK stopwords
nltk.download('stopwords')
nltk.download('punkt')
# Sample text
text = "NLTK is a leading platform for building Python programs to work with human
language data."
# Tokenize the text
tokens = word_tokenize(text)
# Get English stopwords from NLTK
stop_words = set(stopwords.words('english'))
# Filter out stopwords
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
# Print filtered tokens
print("Original Text:")
print(text)
print("\nFiltered Text (without stopwords):")
print(" ".join(filtered_tokens))
Output:
Original Text:
NLTK is a leading platform for building Python programs to work with human language
data.
Filtered Text (without stopwords):
NLTK leading platform building Python programs work human language data .
3) Stemming- Design a Python program to reduce an inflected word down to its word
stem.
import nltk
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
def stem_text(text):
# Initialize Porter Stemmer
stemmer = PorterStemmer()
# Tokenize the text into words
words = word_tokenize(text)
# Stem each word in the text
stemmed_words = [stemmer.stem(word) for word in words]
# Join stemmed words back into sentence
stemmed_text = ' '.join(stemmed_words)
return stemmed_text
# Example usage:
text = "Stemming is used to reduce words down to their word stem."
stemmed_text = stem_text(text)
print("Original Text:")
print(text)
print("\nText after stemming:")
print(stemmed_text)
Output:
Original Text:
Stemming is used to reduce words down to their word stem.
Text after stemming:
stem is use to reduc word down to their word stem .
4) Question Answering System- Design a questioning answer system using Python.
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# Example corpus of texts
corpus = [
"Albert Einstein was a German-born theoretical physicist who developed the theory
of relativity.",
"The Mona Lisa is a half-length portrait painting by the Italian artist Leonardo da
Vinci.",
"Python is an interpreted, high-level, general-purpose programming language.",
"Mount Everest is the highest mountain in the world, located in Nepal."
]
# Preprocess texts
def preprocess_text(text):
tokens = nltk.word_tokenize(text.lower())
return ' '.join(tokens)
processed_corpus = [preprocess_text(text) for text in corpus]
# Vectorize texts using TF-IDF
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(processed_corpus)
# Function to answer questions
def answer_question(question):
question_vec = vectorizer.transform([preprocess_text(question)])
similarities = cosine_similarity(question_vec, X)
idx = similarities.argmax()
return corpus[idx]
# Example questions
questions = [
"Who developed the theory of relativity?",
"What is the Mona Lisa?",
"What is Python used for?",
"Where is Mount Everest located?"
]
# Answering each question
for question in questions:
print("Question:", question)
print("Answer:", answer_question(question))
print()
Output:
Question: Who developed the theory of relativity?
Answer: Albert Einstein was a German-born theoretical physicist who developed the
theory of relativity.
Question: What is the Mona Lisa?
Answer: The Mona Lisa is a half-length portrait painting by the Italian artist Leonardo
da Vinci.
Question: What is Python used for?
Answer: Python is an interpreted, high-level, general-purpose programming language.
Question: Where is Mount Everest located?
Answer: Mount Everest is the highest mountain in the world, located in Nepal.