Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
7 views15 pages

NLP Lab Manual - Final

NLP NOTES 3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views15 pages

NLP Lab Manual - Final

NLP NOTES 3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

NLP Laboratory

Module 1

1) Tokenizing -Design a Python program to splitting up a larger body of text into


smaller lines, words or even create words for a non-English language.

import nltk
nltk.download('punkt') # Downloading necessary NLTK data (if not already
downloaded)

from nltk.tokenize import sent_tokenize, word_tokenize

# French text
text = "Bonjour le monde! Ceci est un texte simple. "

# Tokenize sentences
sentences = sent_tokenize(text)
print("Sentences:", sentences)

# Tokenize words
words = word_tokenize(text, language='french')
print("Words:", words)

Output: Sentences: ['Bonjour le monde!', 'Ceci est un texte simple.']


Words: ['Bonjour', 'le', 'monde', '!', 'Ceci', 'est', 'un', 'texte', 'simple', '.']

2) Corpus- Design a Python program to illustrate corpus.


import nltk
from nltk.corpus import brown
nltk.download('brown')

brown_corpus = brown.words()
word_count = brown_corpus.count('the')

print("\nOccurrences of 'the' in the brown corpus", word_count)

collocations = nltk.Text(brown_corpus).collocation_list()
print("\nCollocations in the brown corpus.")
print(collocations[:10])
fdist = nltk.FreqDist(brown_corpus)
print("\nMost common words in the brown corpus.")
print(fdist.most_common(10))
Output: Occurrences of 'the' in the brown corpus 62713
Collocations in the brown corpus.
[('United', 'States'), ('New', 'York'), ('per', 'cent'), ('Rhode', 'Island'), ('years', 'ago'), ('Los',
'Angeles'), ('White', 'House'), ('Peace', 'Corps'), ('World', 'War'), ('San', 'Francisco')]
Most common words in the brown corpus.
[('the', 62713), (',', 58334), ('.', 49346), ('of', 36080), ('and', 27915), ('to', 25732), ('a',
21881), ('in', 19536), ('that', 10237), ('is', 10011)]

3) Lemmatizing- Design a Python program to group together the different inflected


forms of a word so they can be analyzed as a single item.

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

# Sample text
text = "The cats are chasing mice in the garden. "

# Tokenize the text


words = word_tokenize(text)

# Initialize the WordNetLemmatizer


lemmatizer = WordNetLemmatizer()

# Lemmatize each word


lemmatized_words = [lemmatizer.lemmatize(word) for word in words]

# Print original and lemmatized words


print("Original words:", words)
print("Lemmatized words:", lemmatized_words)

Output: Original words: ['The', 'cats', 'are', 'chasing', 'mice', 'in', 'the', 'garden', '.']
Lemmatized words: ['The', 'cat', 'are', 'chasing', 'mouse', 'in', 'the', 'garden', '.']
4) Process-Implement a python program to process the given text.

import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('punkt')
nltk.download('stopwords')

text = "Hello! This is a sample text. It includes punctuation marks, like commas,
periods, and exclamation marks!"

# Lowercasing and removing punctuation


text = text.lower()
text = ''.join(char for char in text if char not in string.punctuation)

# Tokenization
tokens = word_tokenize(text)

# Removing stop words


stop_words = set(stopwords.words('english'))
tokens = [token for token in tokens if token not in stop_words]

print("Processed text:", tokens)


Output: Processed text: ['hello', 'sample', 'text', 'includes', 'punctuation', 'marks', 'like',
'commas', 'periods', 'exclamation', 'marks']

Module 2
1) Getting text to analyze- Design a Python program to analyze the given text

import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
text = "I absolutely love this movie! the acting is fantastic and the storyline is
captivating"
sid= SentimentIntensityAnalyzer()
sentiment_scores = sid.polarity_scores(text)

if sentiment_scores['compound'] >= 0.05:


sentiment = "Positive"

elif sentiment_scores['compound'] <= -0.05:


sentiment = "Negative"

else:
sentiment = "Neutral"

print("Text: ", text)


print("Sentiment: ", sentiment)
print("Sentiment scores: ", sentiment_scores)
Output: Text: I absolutely love this movie! the acting is fantastic and the storyline is
captivating
Sentiment: Positive
Sentiment scores: {'neg': 0.0, 'neu': 0.567, 'pos': 0.433, 'compound': 0.855}

2) POS Tagger- Design python program to perform part-of-speech tagging on the


text scraped from a website.
import requests
from bs4 import BeautifulSoup

url = "https://www.snickers.com/" # Replace with the actual website URL


response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
text = soup.get_text() # Extract text from HTML

# Tokenize the text


tokens = nltk.word_tokenize(text)

# Perform part-of-speech tagging


tagged_tokens = nltk.pos_tag(tokens)

print(tagged_tokens)
Output: [('Just', 'RB'), ('a', 'DT'), ('moment', 'NN'), ('...', ':'), ('Enable', 'JJ'), ('JavaScript',
'NNP'), ('and', 'CC'), ('cookies', 'NNS'), ('to', 'TO'), ('continue', 'VB')]

3) Default Tagger- Design python program to illustrate default tagger.


import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import DefaultTagger

text = "This is an example sentence for illustrating default tagger"


words = word_tokenize(text)

default_tagger = DefaultTagger('NN')
tagged_words = default_tagger.tag(words)

print("Tagged words: ")


for word, tag in tagged_words:
print(f"{word} : {tag}")
Output: Tagged words:
This : NN
is : NN
an : NN
example : NN
sentence : NN
for : NN
illustrating : NN
default : NN
tagger : NN

4) Chunking- Design a python program to group similar words together based on the
nature of the word.
import nltk
from nltk import word_tokenize, pos_tag

def pos_tagging(text):
tokens = word_tokenize(text)
tagged_words = nltk.pos_tag(tokens)
return tagged_words

def group_similar_words(tagged_words):
grouped_words ={}
for word, pos_tag in tagged_words:
if pos_tag not in grouped_words:
grouped_words[pos_tag] = []
grouped_words[pos_tag].append(word)
return grouped_words

text = "The cat is chasing the mouse. A dog is barking loudly."


tagged_words = pos_tagging(text)
grouped_words = group_similar_words(tagged_words)
for pos_tag, words in grouped_words.items():
print(f"POS Tag: {pos_tag}, Words: {words} ")
Output: POS Tag: DT, Words: ['The', 'the', 'A']
POS Tag: NN, Words: ['cat', 'mouse', 'dog']
POS Tag: VBZ, Words: ['is', 'is']
POS Tag: VBG, Words: ['chasing', 'barking']
POS Tag: ., Words: ['.', '.']
POS Tag: RB, Words: ['loudly']

5) Chinking- Design a Python program to remove a sequence of tokens from a chunk.


def chink_text(text, chink_pattern):
tokens = nltk.word_tokenize(text)
chinked_tokens = [(word, tag) for (word,tag) in
nltk.pos_tag(nltk.word_tokenize(chink_pattern))]
tagged_tokens = nltk.pos_tag(tokens)
cleaned_tokens = [token for token in tagged_tokens if token not in chinked_tokens]
cleaned_text = " ".join([word for word, _ in cleaned_tokens])
return cleaned_text

if __name__ == "__main__":
text = "The quick brown fox jumps over the lazy dog"

chink_pattern = "quick brown fox"


cleaned_text = chink_text(text, chink_pattern)
print("Cleaned Text: ")
print(cleaned_text)
Output: Cleaned Text:
The jumps over the lazy dog

Module 3
1) N grams- Implement a Python program to implement N-Gram
def generate_ngrams(text,n):
words = text.split()
ngrams = []
for i in range(len(words)-n+1):
ngrams.append(words[i:i+n])
return ngrams
text = "This is a sample text for generating n-grams"
n=3
result = generate_ngrams(text,n)
print(result)
Output: [['This', 'is', 'a'], ['is', 'a', 'sample'], ['a', 'sample', 'text'], ['sample', 'text', 'for'],
['text', 'for', 'generating'], ['for', 'generating', 'n-grams']]

2) Smoothing-Design a Python program to perform smoothing using various


methods in Python.
def laplace_smoothing(word_counts, vocab_size):
smoothed_counts = {}
total_words = sum(word_counts.values())

for word, count in word_counts.items():


smoothed_counts[word] = (count+1)/(total_words + vocab_size)
return smoothed_counts

word_counts = {'apple':3,'banana':2, 'orange':1}


vocab_size =100
smoothed_count=laplace_smoothing(word_counts,vocab_size)
print('Original count',word_counts)
print("Smooth count",smoothed_count)
Output: Original count {'apple': 3, 'banana': 2, 'orange': 1}
Smooth count {'apple': 0.03773584905660377, 'banana': 0.02830188679245283,
'orange': 0.018867924528301886}

3) Good turing- Develop a Python program to calculate good turing frequency.

from collections import Counter


def good_turing(frequencies):
freq_of_freq = Counter(frequencies)

good_turing_frequencies = {}
for freq, freq_count in freq_of_freq.items():
if freq+1 in freq_of_freq:
good_turing_frequencies[freq] = (freq +1) * (freq_of_freq[freq +1]/freq_count)
else:
good_turing_frequencies[freq] = freq_count/len(frequencies)
return good_turing_frequencies

text = "The quick brown fox jumps over the lazy dog"

word_lengths = [len(word) for word in text.split()]


word_length_counts = Counter(word_lengths)
good_turing_frequencies = good_turing(word_lengths)
print("Word Length\tFrequency\tGood-turing Frequency")
for length, freq in word_length_counts.items():
gt_freq = good_turing_frequencies[length]
print(f"{length}\t\t{freq}\t\t{gt_freq}")
Output:
Word Length Frequency Good-turing Frequency
3 4 2.0
5 3 0.3333333333333333
4 2 7.5

Module 4
1) Lexical Semantics- Design Python program to do text classification.
# Import necessary libraries
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Load the 20 newsgroups dataset
newsgroups = fetch_20newsgroups(subset='all')
categories = newsgroups.target_names

# Split the dataset into training and test sets


X_train, X_test, y_train, y_test = train_test_split(newsgroups.data, newsgroups.target,
test_size=0.3, random_state=42)

# Initialize the TF-IDF Vectorizer and transform the data


vectorizer = TfidfVectorizer(stop_words='english', max_df=0.7)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Train the Logistic Regression classifier


classifier = LogisticRegression(max_iter=1000, random_state=42)
classifier.fit(X_train_tfidf, y_train)

# Predict the labels for the test set and print accuracy
y_pred = classifier.predict(X_test_tfidf)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

# Function to classify new text data


def classify_text(text):
text_tfidf = vectorizer.transform([text])
prediction = classifier.predict(text_tfidf)
return categories[prediction[0]]

# Example usage
new_text = "NASA launches a new space mission."
predicted_category = classify_text(new_text)
print(f"The predicted category for the new text is: {predicted_category}")

Output: Accuracy: 90.13%


The predicted category for the new text is: sci.space

2) Meaning Representation- Implement a Python program to represent the meaning


of the given text.
import nltk

def represent_meaning(text):
"""
This function takes a sentence and returns a simple dictionary
representing its meaning based on part-of-speech tags.
"""
# Download nltk resources if not already installed
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Tokenize the sentence


tokens = nltk.word_tokenize(text)

# Get part-of-speech tags for each token


tags = nltk.pos_tag(tokens)

# Create a dictionary to represent meaning


meaning = {}
for token, tag in tags:
if tag.startswith('VB'): # Verbs
meaning['action'] = token
elif tag.startswith('NN'): # Nouns
if 'subject' not in meaning:
meaning['subject'] = token
else:
meaning['object'] = token
elif tag.startswith('JJ'): # Adjectives
meaning['adjective'] = token

return meaning

# Example usage
sentence = "The cat chased the mouse."
meaning = represent_meaning(sentence)
print(sentence)
print(meaning)

Output: The cat chased the mouse.


{'subject': 'cat', 'action': 'chased', 'object': 'mouse'}

3) Disambiguity-Design the lesk algorithm in Python to handle word sense


disambiguation.
# Import necessary libraries
import nltk
from nltk.corpus import wordnet as wn
from nltk.tokenize import word_tokenize

nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')
# Function to compute overlap between context and gloss
def compute_overlap(context, gloss):
context = set(context)
gloss = set(gloss)
return len(context & gloss)

# Lesk algorithm for word sense disambiguation


def lesk_algorithm(word, sentence):
# Tokenize the sentence and get the context
context = word_tokenize(sentence)

# Get all synsets (senses) of the word


synsets = wn.synsets(word)

if not synsets:
return None

# Initialize variables to keep track of the best sense and max overlap
best_sense = synsets[0]
max_overlap = 0

for sense in synsets:


# Get the gloss of the current sense and tokenize it
gloss = word_tokenize(sense.definition())

# Compute the overlap between context and gloss


overlap = compute_overlap(context, gloss)

# Update best_sense if the current sense has more overlap


if overlap > max_overlap:
max_overlap = overlap
best_sense = sense
return best_sense

# Example usage
sentence = "I went to the bank to deposit my money."
word = "money"
sense = lesk_algorithm(word, sentence)
if sense:
print(f"The best sense for '{word}' in the sentence is: {sense.name()}")
print(f"Definition: {sense.definition()}")
else:
print(f"No senses found for the word '{word}'.")
Output: The best sense for 'money' in the sentence is: money.n.03
Definition: the official currency issued by a government or national bank
Module 5

1) Information Extraction- Design Python programs to extract structured


information from unstructured information.

import spacy
import email
from email.policy import default

# Check if spaCy model is installed and download if necessary


try:
nlp = spacy.load("en_core_web_sm")
except OSError:
import subprocess
import sys
subprocess.check_call([sys.executable, '-m', 'spacy', 'download', 'en_core_web_sm'])
nlp = spacy.load("en_core_web_sm")

def named_entity_recognition(text):
"""
Perform Named Entity Recognition (NER) on the given text using spaCy.

Args:
text (str): The text to process.

Returns:
list: A list of tuples containing entities and their labels.
"""
# Process the text
doc = nlp(text)

# Extract named entities


entities = [(ent.text, ent.label_) for ent in doc.ents]
return entities

# Sample email content


email_content = """From: [email protected]
To: [email protected]
Subject: Sample Email

This is a sample email for testing information extraction.


"""

# Parse the email content


msg = email.message_from_string(email_content, policy=default)
# Extract and print email fields
from_address = msg['From']
to_address = msg['To']
subject = msg['Subject']
body = msg.get_body(preferencelist=('plain')).get_content()

print("Named Entity Recognition with spaCy:")


entities = named_entity_recognition(body)
for entity, label in entities:
print(f"Entity: {entity}, Label: {label}")

print("\nExtracting Information from Emails:")


print(f"From: {from_address}")
print(f"To: {to_address}")
print(f"Subject: {subject}")
print(f"Body:\n{body}")
Output: Named Entity Recognition with spaCy:

Extracting Information from Emails:


From: [email protected]
To: [email protected]
Subject: Sample Email
Body:
This is a sample email for testing information extraction.

2) Filtering Stop Words- Implement a python program to filtering stopwords.


import nltk
nltk.download('punkt')
nltk.download('stopwords')

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Download NLTK stopwords


nltk.download('stopwords')
nltk.download('punkt')
# Sample text
text = "NLTK is a leading platform for building Python programs to work with human
language data."

# Tokenize the text


tokens = word_tokenize(text)

# Get English stopwords from NLTK


stop_words = set(stopwords.words('english'))

# Filter out stopwords


filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

# Print filtered tokens


print("Original Text:")
print(text)
print("\nFiltered Text (without stopwords):")
print(" ".join(filtered_tokens))
Output:
Original Text:
NLTK is a leading platform for building Python programs to work with human language
data.
Filtered Text (without stopwords):
NLTK leading platform building Python programs work human language data .

3) Stemming- Design a Python program to reduce an inflected word down to its word
stem.

import nltk
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

def stem_text(text):
# Initialize Porter Stemmer
stemmer = PorterStemmer()

# Tokenize the text into words


words = word_tokenize(text)

# Stem each word in the text


stemmed_words = [stemmer.stem(word) for word in words]

# Join stemmed words back into sentence


stemmed_text = ' '.join(stemmed_words)

return stemmed_text

# Example usage:
text = "Stemming is used to reduce words down to their word stem."
stemmed_text = stem_text(text)

print("Original Text:")
print(text)
print("\nText after stemming:")
print(stemmed_text)
Output:
Original Text:
Stemming is used to reduce words down to their word stem.

Text after stemming:


stem is use to reduc word down to their word stem .

4) Question Answering System- Design a questioning answer system using Python.

import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Example corpus of texts


corpus = [
"Albert Einstein was a German-born theoretical physicist who developed the theory
of relativity.",
"The Mona Lisa is a half-length portrait painting by the Italian artist Leonardo da
Vinci.",
"Python is an interpreted, high-level, general-purpose programming language.",
"Mount Everest is the highest mountain in the world, located in Nepal."
]

# Preprocess texts
def preprocess_text(text):
tokens = nltk.word_tokenize(text.lower())
return ' '.join(tokens)

processed_corpus = [preprocess_text(text) for text in corpus]

# Vectorize texts using TF-IDF


vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(processed_corpus)

# Function to answer questions


def answer_question(question):
question_vec = vectorizer.transform([preprocess_text(question)])
similarities = cosine_similarity(question_vec, X)
idx = similarities.argmax()
return corpus[idx]

# Example questions
questions = [
"Who developed the theory of relativity?",
"What is the Mona Lisa?",
"What is Python used for?",
"Where is Mount Everest located?"
]

# Answering each question


for question in questions:
print("Question:", question)
print("Answer:", answer_question(question))
print()

Output:
Question: Who developed the theory of relativity?
Answer: Albert Einstein was a German-born theoretical physicist who developed the
theory of relativity.

Question: What is the Mona Lisa?


Answer: The Mona Lisa is a half-length portrait painting by the Italian artist Leonardo
da Vinci.

Question: What is Python used for?


Answer: Python is an interpreted, high-level, general-purpose programming language.

Question: Where is Mount Everest located?


Answer: Mount Everest is the highest mountain in the world, located in Nepal.

You might also like