0% found this document useful (0 votes)

330 views15 pages

NLP Lab Manual

Uploaded by

shalima

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

330 views15 pages

NLP Lab Manual

Uploaded by

shalima

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

NLP LAB

PROGRAMS
PROGRAM 1

AIM: To study Preprocessing of text (Tokenization, Filtration, Script Validation, Stop Word
Removal, and Stemming)
i. Word and sentence tokenization:
PROGRAM:
from nltk import word_tokenize, sent_tokenize
sent = "GeeksforGeeks is a great learning platform.\
It is one of the best for Computer Science students."
print(word_tokenize(sent))
print(sent_tokenize(sent))
OUTPUT:

ii. Stemming:

PROGRAM:
from nltk.stem import PorterStemmer

# create an object of class PorterStemmer

porter = PorterStemmer()
print(porter.stem("play"))
print(porter.stem("playing"))
print(porter.stem("plays"))
print(porter.stem("played"))

OUTPUT:
iii. Lemmatization:

PROGRAM:

from nltk.stem import WordNetLemmatizer

# create an object of class WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("plays", 'v'))
print(lemmatizer.lemmatize("played", 'v'))
print(lemmatizer.lemmatize("play", 'v'))
print(lemmatizer.lemmatize("playing", 'v'))
OUTPUT:

iv. Speech tagging:

PROGRAM:
from nltk import pos_tag
from nltk import word_tokenize
text = "GeeksforGeeks is a Computer Science platform."
tokenized_text = word_tokenize(text)
tags = tokens_tag = pos_tag(tokenized_text)
OUTPUT:
v. Stop word:
PROGRAM:
import nltk
from nltk.corpus import stopwords
print(stopwords.words('english'))
from nltk.tokenize import sent_tokenize, word_tokenize

EXAMPLE_TEXT = "Hello Mr. Smith, how are you doing today? The weather is great, and
Python is awesome. The sky is pinkish-blue. You shouldn't eat cardboard."

print(sent_tokenize(EXAMPLE_TEXT))

OUTPUT:

RESULT: The above program is successfully executed in given python software and output also
verified successfully.
PROGRAM 2

AIM: To write a program in python using NLTK library to study Word Generation.
LIST:
PROGRAM:
i. General list:

>>> sent2
['The', 'family', 'of', 'Dashwood', 'had', 'long',
'been', 'settled', 'in', 'Sussex', '.']
>>> sent3
['In', 'the', 'beginning', 'God', 'created', 'the',
'heaven', 'and', 'the', 'earth', '.']

ii. Concatenation of two list:

>>> ['Monty', 'Python'] + ['and', 'the', 'Holy', 'Grail']

['Monty', 'Python', 'and', 'the', 'Holy', 'Grail']
>>>

>>> sent4 + sent1

['Fellow', '-', 'Citizens', 'of', 'the', 'Senate', 'and', 'of', 'the',
'House', 'of', 'Representatives', ':', 'Call', 'me', 'Ishmael', '.']
>>>

iii. Appending list:

>>> sent1.append("Some")
>>> sent1
['Call', 'me', 'Ishmael', '.', 'Some']
>>>

INDEXING LISTS:

PROGRAM:

i. Find index of the list

>>> text4.index('awaken')
173
>>>

ii. Slicing

>>> text5[16715:16735]
['U86', 'thats', 'why', 'something', 'like', 'gamefly', 'is', 'so',
'good',
'because', 'you', 'can', 'actually', 'play', 'a', 'full', 'game',
'without',
'buying', 'it']
>>> text6[1600:1625]
['We', "'", 're', 'an', 'anarcho', '-', 'syndicalist', 'commune', '.',
'We',
'take', 'it', 'in', 'turns', 'to', 'act', 'as', 'a', 'sort', 'of',
'executive',
'officer', 'for', 'the', 'week']
>>>

VARIABLES:

PROGRAM:

i. Declaration:

>>> sent1 = ['Call', 'me', 'Ishmael', '.']

>>>
ii. Variable and assignments:

>>> my_sent = ['Bravely', 'bold', 'Sir', 'Robin', ',', 'rode',

... 'forth', 'from', 'Camelot', '.']
>>> noun_phrase = my_sent[1:4]
>>> noun_phrase
['bold', 'Sir', 'Robin']
>>> wOrDs = sorted(noun_phrase)
>>> wOrDs
['Robin', 'Sir', 'bold']
>>>
STRINGS:
PROGRAM:
i. Declaration of string:

ii. Multiplication and addition with strings:

iii. split a string into a list:

STEMMING WORDS WITH NLTK

PROGRAM:
# import these modules
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
ps = PorterStemmer()
# choose some words to be stemmed
words = ["program", "programs", "programmer", "programming", "programmers"]
for w in words:
print(w, " : ", ps.stem(w))
OUTPUT:

RESULT: The above program is successfully executed in given python software and output also
verified successfully.
PROGRAM 3

AIM: To write a program in python using NLTK library to study Word Generation.

SENTIMENT ANALYZER:

PROGRAM:
Step 1: Import libraries and load dataset

# import libraries
import pandas as pd

import nltk

from nltk.sentiment.vader import SentimentIntensityAnalyzer

from nltk.corpus import stopwords

from nltk.tokenize import word_tokenize

from nltk.stem import WordNetLemmatizer

# download nltk corpus (first time only)

import nltk

nltk.download('all')

# Load the amazon review dataset

df =
pd.read_csv('https://raw.githubusercontent.com/pycaret/pycaret/master/datasets/a
mazon.csv')

df
Step 2: Preprocessing text

# create preprocess_text function

def preprocess_text(text):

# Tokenize the text

tokens = word_tokenize(text.lower())

# Remove stop words

filtered_tokens = [token for token in tokens if token not in stopwords.words('english')]

# Lemmatize the tokens

lemmatizer = WordNetLemmatizer()

lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]

# Join the tokens back into a string

processed_text = ' '.join(lemmatized_tokens)

return processed_text

# apply the function df

df['reviewText'] = df['reviewText'].apply(preprocess_text)
df

Step 3: NLTK Sentiment Analyzer

# initialize NLTK sentiment analyzer

analyzer = SentimentIntensityAnalyzer()
# create get_sentiment function
def get_sentiment(text):
scores = analyzer.polarity_scores(text)
sentiment = 1 if scores['pos'] > 0 else 0
return sentiment
# apply get_sentiment function
df['sentiment'] = df['reviewText'].apply(get_sentiment)
df
Step 4: confusion matrix.

from sklearn.metrics import confusion_matrix

print(confusion_matrix(df['Positive'], df['sentiment']))
OUTPUT:

Step 5: classification report

from sklearn.metrics import classification_report

print(classification_report(df['Positive'], df['sentiment']))

OUTPUT:

RESULT: The above program is successfully executed in given python software and output also
verified successfully.
PROGRAM 4

AIM: To write a program in python using NLTK library to study N-gram model.
PROGRAM:
# imports
import string
import random
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('reuters')
from nltk.corpus import reuters
from nltk import FreqDist
# input the reuters sentences
sents =reuters.sents()
# write the removal characters such as : Stopwords and punctuation
stop_words = set(stopwords.words('english'))
string.punctuation = string.punctuation +'"'+'"'+'-'+'''+'''+'—'
string.punctuation
removal_list = list(stop_words) + list(string.punctuation)+ ['lt','rt']
removal_list
# generate unigrams bigrams trigrams
unigram=[]
bigram=[]
trigram=[]
tokenized_text=[]
for sentence in sents:
sentence = list(map(lambda x:x.lower(),sentence))
for word in sentence:
if word== '.':
sentence.remove(word)
else:
unigram.append(word)

tokenized_text.append(sentence)
bigram.extend(list(ngrams(sentence, 2,pad_left=True, pad_right=True)))
trigram.extend(list(ngrams(sentence, 3, pad_left=True, pad_right=True)))

# remove the n-grams with removable words

def remove_stopwords(x):
y = []
for pair in x:
count = 0
for word in pair:
if word in removal_list:
count = count or 0
else:
count = count or 1
if (count==1):
y.append(pair)
return (y)
unigram = remove_stopwords(unigram)
bigram = remove_stopwords(bigram)
trigram = remove_stopwords(trigram)
# generate frequency of n-grams
freq_bi = FreqDist(bigram)
freq_tri = FreqDist(trigram)
d = defaultdict(Counter)
for a, b, c in freq_tri:
if(a != None and b!= None and c!= None):
d[a, b] += freq_tri[a, b, c]
# Next word prediction
s=''
def pick_word(counter):
"Chooses a random element."
return random.choice(list(counter.elements()))
prefix = "he", "said"
print(" ".join(prefix))
s = " ".join(prefix)
for i in range(19):
suffix = pick_word(d[prefix])
s=s+' '+suffix
print(s)
prefix = prefix[1], suffix
OUTPUT:

RESULT: The above program is successfully executed in given python software and output also
verified successfully.
PROGRAM 5
AIM: To write a program in python using NLTK library to study N-Grams Smoothing.
PROGRAM:
Step:1
import nltk
from nltk.corpus import brown
from collections import defaultdict, Counter
OUTPUT:
wds = brown.words()
N = len(wds)
print(N)
step 2:
mle_unigram_dist = nltk.FreqDist([w.lower() for w in wds])
bigram_seq = list(nltk.bigrams(wds))
bigram_N = len(bigram_seq)
print(bigram_N)
OUTPUT:
1161192
1161191
Step 3:
wds[:10]

OUTPUT:
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of']
Step 4:
bigram_seq[:10]
Output:
[('The', 'Fulton'), ('Fulton', 'County'), ('County', 'Grand'), ('Grand', 'Jury'), ('Jury', 'said'), ('said',
'Friday'), ('Friday', 'an'), ('an', 'investigation'), ('investigation', 'of'), ('of', "Atlanta's")]
Step 5: frequency distribution:

# MLE stands for Maximum Likelihood Estimate

mle_bigram_dist = nltk.FreqDist((x.lower(),y.lower()) for (x,y) in bigram_seq)
print(mle_unigram_dist)
print(mle_unigram_dist['the'])
print(mle_bigram_dist)
print(mle_bigram_dist['the','only'])
OUTPUT:
<FreqDist with 49815 samples and 1161192 outcomes>
69971
<FreqDist with 436003 samples and 1161191 outcomes>
258
Step 6:
print(mle_bigram_dist['the','time'])
print(mle_bigram_dist['the','boy'])
print(mle_bigram_dist['the','red'])

OUTPUT:
251
81
44
Step 7:
print(49815**2)
PRINT(F'{49815**2:,}')

OUTPUT:
2481534225
2,481,534,225
Step 8:
print(436003/(49815**2))
print(f'{436003/(49815**2):.3%}')

OUTPUT:
0.00017569896703721667
0.018%
Step 9: Normalization
norm_factor = float(N)/(N + V**2)

bigram_norm_factor = float(bigram_N)/(bigram_N + V**2)

new_cts = defaultdict(lambda:norm_factor)

for w in mle_unigram_dist:
new_cts[w] = mle_unigram_dist[w] * norm_factor

new_bigram_cts = defaultdict(lambda:bigram_norm_factor)
for big in mle_bigram_dist:
new_bigram_cts[big] = mle_bigram_dist[big] * norm_factor

new_laplace_cts = [float(new_bigram_cts[('in','the')]),
float(new_bigram_cts[('said','the')]),
float(new_bigram_cts[('sewer','brother')]),
]
old_cts = [float(mle_bigram_dist[('in','the')]),
float(mle_bigram_dist[('said','the')]),
float(mle_bigram_dist[('sewer','brother')]),
]
print()
print('{0:<16s} {1:^9s} {2:^9s}'.format('','Raw cts','Smthd Cts'))
for (i,p) in enumerate(['in_the','said_the','sewer_brother']):
print('{0:<16s} {1:<6} {2:<6.7f}'.format(p,old_cts[i],new_laplace_cts[i]))

print()
print('{0:<16s} {1:^9s} {2:^9s}'.format('','Raw cts','Smthd Cts'))
for (i,p) in enumerate(['in']):
print('{0:<16s} {1:<6} {2:<6.7f}'.format(p,mle_unigram_dist[p],new_cts[p]))

print()
print('P(the | in)', end = ' ')
print(new_bigram_cts['in','the']/new_cts['in'])

OUTPUT:

RESULT: The above program is successfully executed in given python software and output also
verified successfully.

Benlac Module 5
No ratings yet
Benlac Module 5
9 pages
New Criticism and Formalism PPT - PPT - 20240224 - 120834 - 0000
No ratings yet
New Criticism and Formalism PPT - PPT - 20240224 - 120834 - 0000
23 pages
21ML1601 NLP QB
No ratings yet
21ML1601 NLP QB
34 pages
NLP - (Natural Language Processing Lab Manual)
No ratings yet
NLP - (Natural Language Processing Lab Manual)
12 pages
Al3501 NLP
100% (1)
Al3501 NLP
2 pages
83 Revision Questions For IGCSE Questions Solutions PDF
100% (4)
83 Revision Questions For IGCSE Questions Solutions PDF
5 pages
Smooth N-Gram
No ratings yet
Smooth N-Gram
2 pages
Unit I - NLP
No ratings yet
Unit I - NLP
24 pages
What Happened To You Book Discussion Guide-National Version
No ratings yet
What Happened To You Book Discussion Guide-National Version
7 pages
AI & Speech Analysis Exam Guide
No ratings yet
AI & Speech Analysis Exam Guide
5 pages
Unit 1
No ratings yet
Unit 1
99 pages
NLP Important and Super Important Questions-18CS743
No ratings yet
NLP Important and Super Important Questions-18CS743
2 pages
NLP Revision Notes and Applications
No ratings yet
NLP Revision Notes and Applications
4 pages
UNIT-V NLP
No ratings yet
UNIT-V NLP
25 pages
Unit 1
No ratings yet
Unit 1
35 pages
IF4071 - Deep Learning Laboratory
No ratings yet
IF4071 - Deep Learning Laboratory
1 page
NLP Question Bank
No ratings yet
NLP Question Bank
1 page
2024 and 2025 Python ML - DL - AI IEEE Projects List
100% (1)
2024 and 2025 Python ML - DL - AI IEEE Projects List
6 pages
NLP Notes Last Sem
No ratings yet
NLP Notes Last Sem
48 pages
Nlp-Unit-I Final
No ratings yet
Nlp-Unit-I Final
31 pages
NLP-1 (Tokenization)
100% (1)
NLP-1 (Tokenization)
10 pages
NLP UNIT 5 Part B
100% (2)
NLP UNIT 5 Part B
31 pages
Chapter 1.
No ratings yet
Chapter 1.
6 pages
NLP Assignment Answer
No ratings yet
NLP Assignment Answer
4 pages
NLP Lab Tasks for Students
No ratings yet
NLP Lab Tasks for Students
16 pages
NLP Notes Unit-3
No ratings yet
NLP Notes Unit-3
19 pages
Environment Consists of All Living and Non Living Things Which Surround Us
No ratings yet
Environment Consists of All Living and Non Living Things Which Surround Us
7 pages
Word Level Analysis
No ratings yet
Word Level Analysis
49 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
17 pages
NLP Lab Manual-1
No ratings yet
NLP Lab Manual-1
18 pages
NLP Course File Notes
No ratings yet
NLP Course File Notes
71 pages
Iat 1 QP NLP
No ratings yet
Iat 1 QP NLP
3 pages
Python NLP Tasks with NLTK
No ratings yet
Python NLP Tasks with NLTK
17 pages
Northern Black Polished Ware in India
100% (1)
Northern Black Polished Ware in India
19 pages
Witsoc Reviewer
No ratings yet
Witsoc Reviewer
11 pages
NLP Manual Final
No ratings yet
NLP Manual Final
28 pages
Artifical Intelligence and Machine Learning Lab
No ratings yet
Artifical Intelligence and Machine Learning Lab
109 pages
N-gram Models in NLP Explained
No ratings yet
N-gram Models in NLP Explained
28 pages
NLP Lab Manual Updated
No ratings yet
NLP Lab Manual Updated
34 pages
Middlemarch: Realism Explored
100% (1)
Middlemarch: Realism Explored
31 pages
Algorithms & Machine Learning Intro
No ratings yet
Algorithms & Machine Learning Intro
76 pages
NLP Unit-V
No ratings yet
NLP Unit-V
30 pages
NLP
No ratings yet
NLP
2 pages
Unit-8: Natural Language: Processing
No ratings yet
Unit-8: Natural Language: Processing
16 pages
Question Bank
No ratings yet
Question Bank
13 pages
Ebook Monitoring Can Help Make Tailings Dams Safer
No ratings yet
Ebook Monitoring Can Help Make Tailings Dams Safer
17 pages
Semantic Analysis in NLP
No ratings yet
Semantic Analysis in NLP
8 pages
Aim: Write A Program To Parse XML Text, Generate Web Graph and Compute Topic Specific Page Rank. Source Code
0% (1)
Aim: Write A Program To Parse XML Text, Generate Web Graph and Compute Topic Specific Page Rank. Source Code
5 pages
Prime and Composite Numbers PDF
No ratings yet
Prime and Composite Numbers PDF
6 pages
Inventory Management and Control System
No ratings yet
Inventory Management and Control System
88 pages
Split System Air Conditioners Manual
No ratings yet
Split System Air Conditioners Manual
20 pages
Unit V Intelligence and Applications: Morphological Analysis/Lexical Analysis
No ratings yet
Unit V Intelligence and Applications: Morphological Analysis/Lexical Analysis
30 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
33 pages
21AD3202 - Natural LanguageProcessing-Record
No ratings yet
21AD3202 - Natural LanguageProcessing-Record
64 pages
Unit-2 Aim 502
No ratings yet
Unit-2 Aim 502
6 pages
Critical Book Review Guide
No ratings yet
Critical Book Review Guide
4 pages
NLP Notes
No ratings yet
NLP Notes
71 pages
NLP Question Bank Answers (Jagmeet)
No ratings yet
NLP Question Bank Answers (Jagmeet)
31 pages
Milling Parameters Guide
No ratings yet
Milling Parameters Guide
1 page
Distributed-Lag Models: Dynamic Effects of Temporary and Permanent Changes
No ratings yet
Distributed-Lag Models: Dynamic Effects of Temporary and Permanent Changes
20 pages
Lecture 3: Text Processing & Minimum Edit Distance Algorithm
No ratings yet
Lecture 3: Text Processing & Minimum Edit Distance Algorithm
57 pages
Unit-1 Aim 502
No ratings yet
Unit-1 Aim 502
15 pages
NLP Sem Imp
No ratings yet
NLP Sem Imp
46 pages
1-NLP - Lab Manual
No ratings yet
1-NLP - Lab Manual
15 pages
NLP Qb-Ese
No ratings yet
NLP Qb-Ese
2 pages
NLP Unit 5
No ratings yet
NLP Unit 5
10 pages
ST62T00CM6 TR
No ratings yet
ST62T00CM6 TR
100 pages
NLP Lab Guide for Students
No ratings yet
NLP Lab Guide for Students
103 pages
Fitness Careers & Event Planning
No ratings yet
Fitness Careers & Event Planning
3 pages
Week 6: Introduction To Natural Language Processing
No ratings yet
Week 6: Introduction To Natural Language Processing
18 pages
CSE4022 Natural-Language-Processing ETH 1 AC41
No ratings yet
CSE4022 Natural-Language-Processing ETH 1 AC41
6 pages
NL-S2 Series Valve Manual
No ratings yet
NL-S2 Series Valve Manual
5 pages
Lecture 1: Introduction To NLP: Understand Concepts Applications
No ratings yet
Lecture 1: Introduction To NLP: Understand Concepts Applications
32 pages
Be Computer Engineering Semester 7 2023 May Dloc III Natural Language Processing Rev 2019 C Scheme
0% (1)
Be Computer Engineering Semester 7 2023 May Dloc III Natural Language Processing Rev 2019 C Scheme
2 pages
Harrington 1 Ton Hand Chain Hoist OM Manual
No ratings yet
Harrington 1 Ton Hand Chain Hoist OM Manual
55 pages
Linear Array Operations in C++
No ratings yet
Linear Array Operations in C++
4 pages
STD V Intl Syllabus 2024 25
No ratings yet
STD V Intl Syllabus 2024 25
10 pages
Compiler Design Course Guide
No ratings yet
Compiler Design Course Guide
114 pages
M.Tech NLP Course Overview
No ratings yet
M.Tech NLP Course Overview
2 pages
Unit 5 - Notes
No ratings yet
Unit 5 - Notes
11 pages
Solutions To NLP I Mid Set A
100% (1)
Solutions To NLP I Mid Set A
8 pages
Summary of Learning
No ratings yet
Summary of Learning
10 pages
Research On The Business Model of Pinduoduo Based
No ratings yet
Research On The Business Model of Pinduoduo Based
6 pages
Anderson Peter Chapter 5 Two
No ratings yet
Anderson Peter Chapter 5 Two
4 pages
Malaysian School Counsellors' Challenges in Job Description, Job Satisfaction and Competency
No ratings yet
Malaysian School Counsellors' Challenges in Job Description, Job Satisfaction and Competency
7 pages
The Construction of Family in Selected Disney Animated Films
No ratings yet
The Construction of Family in Selected Disney Animated Films
4 pages
Assignment 1
No ratings yet
Assignment 1
1 page
PC 101 Life Skills Gathering
No ratings yet
PC 101 Life Skills Gathering
2 pages
Audit of Shareholder's Equity (Roque) PDF
No ratings yet
Audit of Shareholder's Equity (Roque) PDF
1 page
Natural Language Processing
No ratings yet
Natural Language Processing
12 pages

NLP Lab Manual

Uploaded by

NLP Lab Manual

Uploaded by

NLP LAB

# create an object of class PorterStemmer

from nltk.stem import WordNetLemmatizer

iv. Speech tagging:

ii. Concatenation of two list:

>>> ['Monty', 'Python'] + ['and', 'the', 'Holy', 'Grail']

>>> sent4 + sent1

iii. Appending list:

i. Find index of the list

>>> sent1 = ['Call', 'me', 'Ishmael', '.']

>>> my_sent = ['Bravely', 'bold', 'Sir', 'Robin', ',', 'rode',

ii. Multiplication and addition with strings:

STEMMING WORDS WITH NLTK

from nltk.sentiment.vader import SentimentIntensityAnalyzer

from nltk.corpus import stopwords

from nltk.tokenize import word_tokenize

from nltk.stem import WordNetLemmatizer

# download nltk corpus (first time only)

# Load the amazon review dataset

# create preprocess_text function

# Tokenize the text

# Remove stop words

filtered_tokens = [token for token in tokens if token not in stopwords.words('english')]

lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]

# Join the tokens back into a string

processed_text = ' '.join(lemmatized_tokens)

# apply the function df

Step 3: NLTK Sentiment Analyzer

# initialize NLTK sentiment analyzer

from sklearn.metrics import confusion_matrix

Step 5: classification report

from sklearn.metrics import classification_report

# remove the n-grams with removable words

# MLE stands for Maximum Likelihood Estimate

bigram_norm_factor = float(bigram_N)/(bigram_N + V**2)

You might also like