NLP Practicals
NLP Practicals
(TY AI A& B)
PRACTICAL 1
PROBLEM STATEMENT:
OBJECTIVE:
Introduction :
Natural Language Processing (NLP) is a process of manipulating or understanding
the text or speech by any software or machine. An analogy is that humans interact and
understand each other’s views and respond with the appropriate answer. In NLP, this
interaction, understanding, and response are made by a computer instead of a human.
The named entity recognition (NER) is one of the most popular data preprocessing task.
It involves the identification of key information in the text and classification into a set
of predefined categories. An entity is basically the thing that is consistently talked about
or refers to in the text.
NER is the form of NLP.
At its core, NLP is just a two-step process, below are the two steps that are involved:
Detecting the entities from the text
Classifying them into different categories
Some of the categories that are the most important architecture in NER such that:
Person
Organization
Place/ location
Other common tasks include classifying of the following:
date/time.
1
expression
Numeral measurement (money, percent, weight, etc)
E-mail address
Ambiguity in NE
For a person, the category definition is intuitively quite clear, but for computers,
there is some ambiguity in classification. Let’s look at some ambiguous example:
England (Organisation) won the 2019 world cup vs The 2019 world cup
happened in England(Location).
Washington(Location) is the capital of the US vs The first president of the
US was Washington(Person).
Methods of NER
One way is to train the model for multi-class classification using different machine
learning algorithms, but it requires a lot of labelling. In addition to labelling the
model also requires a deep understanding of context to deal with the ambiguity of
the sentences. This makes it a challenging task for a simple machine learning
algorithm.
Another way is that Conditional random field that is implemented by both NLP
Speech Tagger and NLTK. It is a probabilistic model that can be used to model
sequential data such as words. The CRF can capture a deep understanding of the
context of the sentence.
Deep Learning Based NER: deep learning NER is much more accurate than
previous method, as it is capable to assemble words. This is due to the fact that it
used a method called word embedding, that is capable of understanding the semantic
and syntactic relationship between various words. It is also able to learn analyzes
topic-specific as well as high level words automatically. This makes deep learning
NER applicable for performing multiple tasks.
Sentence: Sundar Pichai, the CEO of Google Inc. is walking in the streets of California.
From the above sentence, we can identify three types of entities: (Named Entities)
2
But to do the same thing with the help of computers, we need to help them recognize
entities first so that they can categorize them. So, to do so we can take the help of
machine learning and Natural Language Processing (NLP).
The role of both these things while implementing NER using computers:
NLP: It studies the structure and rules of language and forms intelligent systems that
are capable of deriving meaning from text and speech.
Machine Learning: It helps machines learn and improve over time.
To learn what an entity is, a NER model needs to be able to detect a word or string of
words that form an entity (e.g. California) and decide which entity category it belongs
to.
So, as a concluding step we can say that the heart of any NER model is a two-step
process:
So first, we need to create entity categories, like Name, Location, Event, Organization,
etc., and feed a NER model relevant training data. Then, by tagging some samples of
words and phrases with their corresponding entities, we’ll eventually teach our NER
model to detect the entities and categorize them.
The Named entity recognition (NER) will help us to easily identify the key components
in a text, such as names of people, places, brands, monetary values, and more. And
extracting the main entities from a text helps us to sort the unstructured data and detect
the important information, which is crucial if you have to deal with large datasets.
3
This is a completely optimized and highly accurate library widely used in
spaCy
deep learning
Stanford
For client-server-based architecture, this is a good library in NLTK. This
CoreNLP
is written in JAVA, but it provides modularity to use it in Python.
Python
This is an NLP library which works in Pyhton2 and python3. This is used
TextBlob for processing textual data and provide mainly all type of operation in the
form of API.
Genism is a robust open source NLP library support in Python. This
Gensim
library is highly efficient and scalable.
Tokenizing
By tokenizing, you can conveniently split up text by word or by sentence. This will
allow you to work with smaller pieces of text that are still relatively coherent and
meaningful even outside of the context of the rest of the text. It’s your first step in
turning unstructured data into structured data, which is easier to analyze.
When you’re analyzing text, you’ll be tokenizing by word and tokenizing by sentence.
Here’s what both types of tokenization bring to the table:
Tokenizing by word: Words are like the atoms of natural language. They’re the
smallest unit of meaning that still makes sense on its own. Tokenizing your text
by word allows you to identify words that come up particularly often. For
example, if you were analyzing a group of job ads, then you might find that the
word “Python” comes up often. That could suggest high demand for Python
knowledge, but you’d need to look deeper to know more.
Tokenizing by sentence: When you tokenize by sentence, you can analyze how
those words relate to one another and see more context. Are there a lot of
negative words around the word “Python” because the hiring manager doesn’t
like Python? Are there more terms from the domain of herpetology than the
domain of software development, suggesting that you may be dealing with an
entirely different kind of python than you were expecting?
Here’s how to import the relevant parts of NLTK so you can tokenize by word and by
sentence:
>>>
>>> from nltk.tokenize import sent_tokenize, word_tokenize
4
Now that you’ve imported what you need, you can create a string to tokenize. Here’s
a quote from Dune that you can use:
>>>
>>> example_string = """
... Muad'Dib learned rapidly because his first training was in how to learn.
... And the first lesson of all was the basic trust that he could learn.
... It's shocking to find how many people do not believe they can learn,
... and how many more believe learning to be difficult."""
You can use sent_tokenize() to split up example_string into sentences:
>>>
>>> sent_tokenize(example_string)
["Muad'Dib learned rapidly because his first training was in how to learn.",
'And the first lesson of all was the basic trust that he could learn.',
"It's shocking to find how many people do not believe they can learn, and how many
more believe learning to be difficult."]
Tokenizing example_string by sentence gives you a list of three strings that are
sentences:
1. "Muad'Dib learned rapidly because his first training was in how to learn."
2. 'And the first lesson of all was the basic trust that he could learn.'
3. "It's shocking to find how many people do not believe they can learn, and how
many more believe learning to be difficult."
>>>
>>> word_tokenize(example_string)
["Muad'Dib",
'learned',
'rapidly',
'because',
'his',
'first',
'training',
'was',
'in',
5
'how',
'to',
'learn',
'.',
'And',
'the',
'first',
'lesson',
'of',
'all',
'was',
'the',
'basic',
'trust',
'that',
'he',
'could',
'learn',
'.',
'It',
"'s",
'shocking',
'to',
'find',
'how',
'many',
'people',
'do',
'not',
'believe',
'they',
'can',
'learn',
',',
'and',
'how',
6
'many',
'more',
'believe',
'learning',
'to',
'be',
'difficult',
'.']
You got a list of strings that NLTK considers to be words, such as:
"Muad'Dib"
'training'
'how'
"'s"
','
'.'
See how "It's" was split at the apostrophe to give you 'It' and "'s", but "Muad'Dib" was
left whole? This happened because NLTK knows that 'It' and "'s" (a contraction of
“is”) are two distinct words, so it counted them separately. But "Muad'Dib" isn’t an
accepted contraction like "It's", so it wasn’t read as two separate words and was left
intact.
Here’s how to import the relevant parts of NLTK in order to filter out stop words:
>>>
>>> nltk.download("stopwords")
>>> from nltk.corpus import stopwords
>>> from nltk.tokenize import word_tokenize
Here’s a quote from Worf that you can filter:
>>>
7
>>> worf_quote = "Sir, I protest. I am not a merry man!"
Now tokenize worf_quote by word and store the resulting list in words_in_quote:
>>>
>>> words_in_quote = word_tokenize(worf_quote)
>>> words_in_quote
['Sir', ',', 'protest', '.', 'merry', 'man', '!']
You have a list of the words in worf_quote, so the next step is to create a set of stop words to
filter words_in_quote. For this example, you’ll need to focus on stop words in "english":
>>>
>>> stop_words = set(stopwords.words("english"))
Next, create an empty list to hold the words that make it past the filter:
>>>
>>> filtered_list = []
You created an empty list, filtered_list, to hold all the words in words_in_quote that aren’t stop
words. Now you can use stop_words to filter words_in_quote:
>>>
>>> for word in words_in_quote:
... if word.casefold() not in stop_words:
... filtered_list.append(word)
You iterated over words_in_quote with a for loop and added all the words that weren’t stop words
to filtered_list. You used .casefold() on word so you could ignore whether the letters in word were
uppercase or lowercase. This is worth doing because stopwords.words('english') includes only
lowercase versions of stop words.
Alternatively, you could use a list comprehension to make a list of all the words in your text that
aren’t stop words:
>>>
>>> filtered_list = [
... word for word in words_in_quote if word.casefold() not in stop_words
... ]
When you use a list comprehension, you don’t create an empty list and then add items to the end
of it. Instead, you define the list and its contents at the same time. Using a list comprehension is
often seen as more Pythonic.
>>>
8
>>> filtered_list
['Sir', ',', 'protest', '.', 'merry', 'man', '!']
You filtered out a few words like 'am' and 'a', but you also filtered out 'not', which does affect the
overall meaning of the sentence. (Worf won’t be happy about this.)
Words like 'I' and 'not' may seem too important to filter out, and depending on what kind of
analysis you want to do, they can be. Here’s why:
'I' is a pronoun, which are context words rather than content words:
o Content words give you information about the topics covered in the text or the
sentiment that the author has about those topics.
o Context words give you information about writing style. You can observe
patterns in how authors use context words in order to quantify their writing style.
Once you’ve quantified their writing style, you can analyze a text written by an
unknown author to see how closely it follows a particular writing style so you can
try to identify who the author is.
'not' is technically an adverb but has still been included in NLTK’s list of stop words for
English. If you want to edit the list of stop words to exclude 'not' or make other changes,
then you can download it.
So, 'I' and 'not' can be important parts of a sentence, but it depends on what you’re trying to learn
from that sentence.
Conclusion:
In the above program we can learn how to use the named entity recognisation by
tokenization and various other ways.
9
Natural Language Processing
ASSIGNMENT NO.2 :
TITLE:
PROBLEM STATEMENT:
OBJECTIVE:
10
THEORY:
11
Emotion detection: The sentiment happy, sad, anger, upset,
jolly, pleasant, and so on come under emotion detection. It is
also known as a lexicon method of sentiment analysis.
Aspect based sentiment analysis: It focuses on a particular
aspect like for instance, if a person wants to check the feature
of the cell phone then it checks the aspect such as battery,
screen, camera quality then aspect based is used.
Multilingual sentiment analysis: Multilingual consists of
different languages where the classification needs to be done as
positive, negative, and neutral. This is highly challenging and
comparatively difficult.
12
Automatic Approach: This approach works on the machine
learning technique. Firstly, the datasets are trained and
predictive analysis is done. The next process is the extraction
of words from the text is done. This text extraction can be done
using different techniques such as Naive Bayes, Linear
Regression, Support Vector, Deep Learning like this machine
learning techniques are used.
Hybrid Approach: It is the combination of both the above
approaches i.e. rule-based and automatic approach. The surplus
is that the accuracy is high compared to the other two
approaches.
Applications
Code :
1. Textblob
TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple API for diving into
common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction,
sentiment analysis, classification, translation, and more.
https://textblob.readthedocs.io/en/dev/
import nltk
nltk.download('movie_reviews')
nltk.download('punkt')
14
text = 'Apple is a very succcessful company due to its visionary leadership team. They have a very bright
future'
#Default Method
blob = TextBlob(text)
blob.sentiment
Sentiment(polarity=0.24625000000000002, subjectivity=0.45)
blob.sentiment
#example
blob = TextBlob(text)
blob.sentiment
Sentiment(polarity=-0.125, subjectivity=0.375)
2. Word-Dictionary based
Create Positive Negative word dictionary created from University of Pittsburgh mpqa corpus (link)
http://mpqa.cs.pitt.edu/lexicons/subj_lexicon/
15
The MPQA Opinion Corpus contains news articles from a wide variety of news sources manually
annotated for opinions and other private states (i.e., beliefs, emotions, sentiments, speculations, etc.)
pos_words = []
neg_words = []
line_attrib = line.split()
if polarity =='positive':
pos_words.append(word)
elif polarity=='negative':
neg_words.append(word)
myfile.write('\n'.join(pos_words))
myfile.write('\n'.join(neg_words))
16
Total negative words found: 4911
pos_word_add = ['lifted']
pos_words.append(term)
neg_word_add = ['coronavirus','covid','corona','covid19','pandemic','lockdown']
neg_words.append(term)
import nltk
def calc_sentiment_based_on_word_dict(text):
sentiment_score = 0
words = nltk.word_tokenize(text)
if word in pos_words:
17
print('pos:',word)
sentiment_score=sentiment_score+1
if word in neg_words:
print('neg:',word)
sentiment_score=sentiment_score-1
return sentiment_score/len(words)
Example 1
text = 'Apple is a very succcessful company due to its visionary leadership team. They have a very bright
future'
sentiment = calc_sentiment_based_on_word_dict(text)
pos: visionary
pos: bright
Example2
sentiment = calc_sentiment_based_on_word_dict(text)
neg: lockdown
neg: covid19
neg: pandemic
18
The sentiment score of this text is: -0.30
19
Requirement already satisfied: numpy>=1.15.0 in /usr/local/lib/python3.7/dist-packages (from spacy) (1.21.6)
20
import spacy
nlp = spacy.load("en_core_web_sm")
def spacy_ner(text):
doc = nlp(text)
entities = []
labels = []
position_start = []
position_end = []
if ent.label_ in ['PERSON','ORG','GPE']:
entities.append(ent)
labels.append(ent.label_)
return entities,labels
def fit_ner(df):
ner = df['text'].apply(spacy_ner)
ner_org = {}
ner_per = {}
ner_gpe = {}
21
for x in ner:
#print(list(x))
#print(type(entity.text))
if label =='ORG':
ner_org[entity.text] = ner_org.get(entity.text,0) + 1
elif label=='PERSON':
ner_per[entity.text] = ner_per.get(entity.text,0) + 1
else:
ner_gpe[entity.text] = ner_gpe.get(entity.text,0) + 1
return {'ORG':ner_org,'PER':ner_per,'GPE':ner_gpe}
named_entities = fit_ner(news_df)
named_entities[‘ORG’]
22
Information Report': 2, 'Magistrate': 3, 'NDTV': 2, 'St. Bernard’s Hospital': 1, 'Trinity Hospital': 1, 'the
University of Chicago Medical Center': 1, 'Mercy Hospital': 1, 'Deboni': 6, 'Nolan': 2, 'DUI': 2, 'THC': 1,
'Federman': 1, 'Chicago Police Supt': 1, 'Madurai': 1, 'High Court': 1, 'Human Rights': 1, "People's Watch": 1,
'the National Oceanic & Atmospheric Administration': 2, 'Sonoma Tech Meteorologist': 2, 'NOAA': 1, 'Comer
Children’s Hospital': 1, 'Deutsch': 1, 'Shades of Darkness Window Tinting': 1, 'Chrysler': 1, 'Palos Community
Hospital': 1, 'the Phoolbagan Police Station': 1, 'DCP': 3, 'Muralidhar Agarwal': 1, 'the Kolkata Police': 1, 'the
Mahadevpura Police': 1, 'Amit Agarwal': 2, 'Sooraj': 1, 'Kollam Rural Police': 1, 'Duke University': 2,
'Fahrenheit': 2, 'the National Climate Assessment': 1, 'Nasdaq': 2, 'Dow': 2, 'MUFG': 1, 'United Airlines': 1,
'Delta': 1, 'Bleakley Advisory Group': 1, 'Carnival': 1, 'Royal Caribbean': 1, 'Disney': 1, 'The Santa Fe Police
Department': 1, 'SFR': 1, 'FBI': 2}
named_entities[‘PER’]
{'Target': 1, 'Brian Cornell': 1, 'Shawn': 1, 'Max Melia': 1, 'Max': 2, 'Rahman Daiyan': 1, 'Dr Emma Lovell': 1,
'Rose Amal': 1, 'Wolfgang Palme': 1, '-11º Celsius': 1, 'Kamryn Johnson': 1, 'Kamryn': 1, 'Ron Johnson': 1,
'George Floyd': 3, 'Justin Amash': 1, 'Tim Scott': 1, 'Scott': 1, 'Madisen Hallberg': 1, 'Emmanuel Henreid': 1,
'Hallberg': 1, 'Peter Horby': 1, 'Martin Landry': 1, 'Horby': 1, 'Cher': 1, 'Kaavan': 1, 'April Judd': 1, 'Jody
McDowell': 1, 'Tamil Nadu': 2, 'Pennis': 4, 'Santhankulam': 2, 'Arun Balagopalan': 2, 'Jessie D. Gregory': 1,
'Fernwood': 1, 'Gresham': 1, 'Stephen Nolan': 1, 'Kevin Deboni': 1, 'Nolan': 6, 'Caleb White': 1, 'Irving
Federman': 1, 'David Brown': 1, 'Brown': 1, 'Stroger Hospital': 1, 'Stroger': 2, 'J. Jones': 1, 'E Palaniswami': 1,
'Kanimozhi': 1, 'Henri Tiphagne': 1, 'Saharan': 1, 'Jeff Beamish': 2, 'Wayne Deutsch': 1, 'Deutsch': 3, 'Jasean
Francis': 1, 'Charles Riley': 1, 'Puma': 1, 'Kolkata': 3, 'Lalita Dhandhania': 1, 'Shilpi': 1, 'Mrs Shilpi Agarwal': 1,
'Mahadevpura PS': 1, 'Uthra': 2, 'Hari Sankar': 1, 'Complications': 1, 'Chris Rupkey': 1, 'Peter Boockvar': 1,
'Retailer Gap': 1, 'Paul Joye': 1, 'Joye': 1, 'Bajit Singh': 1}
named_entities['GPE']
23
Targetted by finding sentences containing the named entities and performing sentiment analysis only on
those sentences one by one
import nltk
def get_keyword_sentences(text,keyword):
sentences = nltk.sent_tokenize(text)
sent_of_interest = []
if keyword in sent.lower():
sent_of_interest.append(sent)
def get_sentiment(list_of_sent):
sentiment = 0
count = 0
if list_of_sent !=False:
blob = TextBlob(sent)
sentiment += blob.sentiment.polarity
count += len(sent)
24
def extract_sentiment(df,keywords,top=10):
"""df: data for the cluster, keywords: ner for that cluster"""
sentiment_dict = {}
df['sentences'] = df['text'].apply(get_keyword_sentences,keyword=keyword.lower())
keyword_sentiment = df['sentences'].apply(get_sentiment).sum()
sentiment_dict[keyword] = keyword_sentiment
return sentiment_dict
keywords = named_entities['ORG']
sentiment_result_org = extract_sentiment(news_df,keywords,top=30)
sentiment_result_org
[('GNN', 0.004569909563803812),
('Target', 0.0012131536833410769),
25
('the Great American Outdoors Act', 0.0011459492888064317),
('Senate', 0.0007037413380270522),
('FDA', 0.0005633802816901409),
('DUI', 0.0003423736143555005),
('Nolan', 0.00020554291972032683),
('Jayaraj', 0.0),
('NDTV', 0.0),
('Deboni', -0.0003628628628628629),
('Magistrate', -0.0006684491978609625),
('DCP', -0.0010416666666666667),
('Fahrenheit', -0.0010437154242313098)]
keywords = named_entities['PER']
sentiment_result_org = extract_sentiment(news_df,keywords,top=30)
26
sentiment_result_org
book = open('book_cab_and_goose.txt',encoding="utf8").read()
blob.sentiment
27
blob = TextBlob(book)
blob.sentiment
blob = TextBlob(book)
polarity = []
subjectivity = []
sentences = []
polarity.append(sentence.sentiment.polarity)
subjectivity.append(sentence.sentiment.subjectivity)
sentences.append(str(sentence.raw))
PatternAnalyzerDF['sentence'] = sentences
PatternAnalyzerDF['polarity'] = polarity
PatternAnalyzerDF['subjectivity'] = subjectivity
PatternAnalyzerDF.head(10)
0 The Project Gutenberg eBook, Cab and Caboose, by Kirk Munroe This eBook 0.000000 0.000000
28
sentence polarity subjectivity
is for the use of anyone anywhere at no cost and with almost no restrictions
whatsoever.
You may copy it, give it away or re-use it under the terms of the Project
Gutenberg License included with this eBook or online at www.gutenberg.org
Title: Cab and Caboose The Story of a Railroad Boy Author: Kirk Munroe
Release Date: September 4, 2007 [eBook #22497] Language: English
1 ***START OF THE PROJECT GUTENBERG EBOOK CAB AND 0.187500 0.375000
CABOOSE*** E-text prepared by Mark C. Orton, Linda McKeown, Anne
Storer, and the Project Gutenberg Online Distributed Proofreading Team
(http://www.pgdp.net) Note: Project Gutenberg also has an HTML version of
this file which includes the original illustrations.
29
sentence polarity subjectivity
PatternAnalyzerDF.sort_index(inplace=True)
sentiment_top_df = PatternAnalyzerDF.head(n=200)
pd.set_option('display.max_colwidth', 200)
plt.figure().set_size_inches(20, 10)
plt.plot(sentiment_top_df['polarity'])
plt.xlabel('Sentence')
plt.ylabel('Polarity')
plt.show()
30
Conclusion
EXPERIMENT NUMBER 03
TITLE: -
OBJECTIVE: -
THEORY: -
Text summarization is the practice of breaking down long publications into manageable paragraphs or sentences.
The procedure extracts important information while also ensuring that the paragraph's sense is preserved. This
shortens the time it takes to comprehend long materials like research articles while without omitting critical
information.
31
The process of constructing a concise, cohesive, and fluent summary of a lengthier text document, which includes
Because of the growing availability of documents, much research in the field of natural language processing
(NLP) for automatic text summarization is required. The job of creating a succinct and fluent summary without
the assistance of a person while keeping the sense of the original text material is known as automatic text
summarization.
1. Extraction-based summarization
The extractive text summarizing approach entails extracting essential words from the source material and
2. Abstractive Summarization
Another way of text summarization is abstractive summarization. We create new sentences from the original
32
On the basis of Context
3. Domain-Specific
In domain-specific summarization, domain knowledge is applied. Specific context, knowledge, and language
can be merged using domain-specific summarizers. For example, models can be combined with the
terminology used in medical science so that they can better grasp and summarize scientific texts.
4. Query-based
Query-based summaries are primarily concerned with natural language questions. This is similar to the
5. Generic
Generic summarizers, unlike domain-specific or query-based summarizers, are not programmed to make any
assumptions. The content from the source document is simply condensed or summarized.
#Code –
import spacy
nlp = spacy.load("en_core_web_sm")
33
Guido van Rossum began working on Python in the late 1980s as a successor to the ABC programming la
nguage and first released it in 1991 as Python 0.9.0.[36] Python 2.0 was released in 2000 and introduced
new features such as list comprehensions, cycle-
detecting garbage collection, reference counting, and Unicode support.
Python 3.0, released in 2008, was a major revision that is not completely backward-
compatible with earlier versions. Python 2 was discontinued with version 2.7.18 in 2020."""
nlp =spacy.load('en_core_web_sm')
# Tokenization of Text
mytokens = [token.text for token in docx]
#For calculating the frequencies of each word that are not the stopwords-->
word_freq={}
for word in docx:
if word.text not in stopwords:
if word.text not in word_freq.keys():
word_freq[word.text]=1;
else:
word_freq[word.text]+=1;
'paradigms': 1,
'including': 1,
'structured': 1,
'(': 1,
'particularly': 1,
'procedural': 1,
')': 1,
'object': 1,
'oriented': 1,
'functional': 1,
'described': 1,
'"': 2,
'batteries': 1,
'included': 1,
'comprehensive': 1,
34
'standard': 1,
'library': 1,
'Guido': 1,
'van': 1,
'Rossum': 1,
'began': 1,
'working': 1,
'late': 1,
'1980s': 1,
'successor': 1,
'ABC': 1,
'released': 3,
'1991': 1,
'0.9.0.[36': 1,
']': 1,
'2.0': 1,
'2000': 1,
'introduced': 1,
'new': 1,
'features': 1,
'list': 1,
'comprehensions': 1,
'cycle': 1,
'detecting': 1,
'collection': 1,
'reference': 1,
'counting': 1,
'Unicode': 1,
'support': 1,
'3.0': 1,
'2008': 1,
'major': 1,
'revision': 1,
'completely': 1,
'backward': 1,
'compatible': 1,
'earlier': 1,
'versions': 1,
'2': 1,
'discontinued': 1,
'version': 1,
'2.7.18': 1,
'2020': 1}
max_freq=max(word_freq.values())
print(word_freq)
35
{'Python': 0.875, 'high': 0.125, '-': 0.875, 'level': 0.125, ',': 1.0, 'general': 0.125, 'purpose': 0.125,
'programming': 0.5, 'language': 0.375, '.': 1.0, 'Its': 0.125, 'design': 0.125, 'philosophy': 0.125, 'emphasizes':
0.125, 'code': 0.125, 'readability': 0.125, 'use': 0.125, 'significant': 0.125, 'indentation': 0.125, '\n': 0.375,
'dynamically': 0.125, 'typed': 0.125, 'garbage': 0.25, 'collected': 0.125, 'It': 0.25, 'supports': 0.125,
'multiple': 0.125, 'paradigms': 0.125, 'including': 0.125, 'structured': 0.125, '(': 0.125, 'particularly': 0.125,
'procedural': 0.125, ')': 0.125, 'object': 0.125, 'oriented': 0.125, 'functional': 0.125, 'described': 0.125, '"':
0.25, 'batteries': 0.125, 'included': 0.125, 'comprehensive': 0.125, 'standard': 0.125, 'library': 0.125, 'Guido':
0.125, 'van': 0.125, 'Rossum': 0.125, 'began': 0.125, 'working': 0.125, 'late': 0.125, '1980s': 0.125,
'successor': 0.125, 'ABC': 0.125, 'released': 0.375, '1991': 0.125, '0.9.0.[36': 0.125, ']': 0.125, '2.0': 0.125,
'2000': 0.125, 'introduced': 0.125, 'new': 0.125, 'features': 0.125, 'list': 0.125, 'comprehensions': 0.125,
'cycle': 0.125, 'detecting': 0.125, 'collection': 0.125, 'reference': 0.125, 'counting': 0.125, 'Unicode': 0.125,
'support': 0.125, '3.0': 0.125, '2008': 0.125, 'major': 0.125, 'revision': 0.125, 'completely': 0.125,
'backward': 0.125, 'compatible': 0.125, 'earlier': 0.125, 'versions': 0.125, '2': 0.125, 'discontinued': 0.125,
'version': 0.125, '2.7.18': 0.125, '2020': 0.125}
# Sentence Tokens-->
sentence_list = [ sentence for sentence in docx.sents]
['python', 'is', 'a', 'high', '-', 'level', ',', 'general', '-', 'purpose', 'programming', 'language', '.', 'its', 'design',
'philosophy', 'emphasizes', 'code', 'readability', 'with', 'the', 'use', 'of', 'significant', 'indentation', '.', '\n',
'python', 'is', 'dynamically', '-', 'typed', 'and', 'garbage', '-', 'collected', '.', 'it', 'supports', 'multiple',
'programming', 'paradigms', ',', 'including', 'structured', '(', 'particularly', 'procedural', ')', ',', 'object', '-',
'oriented', 'and', 'functional', 'programming', '.', 'it', 'is', 'often', 'described', 'as', 'a', '"', 'batteries', 'included',
'"', 'language', 'due', 'to', 'its', 'comprehensive', 'standard', 'library', '.', '\n', 'guido', 'van', 'rossum', 'began',
'working', 'on', 'python', 'in', 'the', 'late', '1980s', 'as', 'a', 'successor', 'to', 'the', 'abc', 'programming',
'language', 'and', 'first', 'released', 'it', 'in', '1991', 'as', 'python', '0.9.0.[36', ']', 'python', '2.0', 'was', 'released',
'in', '2000', 'and', 'introduced', 'new', 'features', 'such', 'as', 'list', 'comprehensions', ',', 'cycle', '-', 'detecting',
'garbage', 'collection', ',', 'reference', 'counting', ',', 'and', 'unicode', 'support', '.', '\n', 'python', '3.0', ',',
'released', 'in', '2008', ',', 'was', 'a', 'major', 'revision', 'that', 'is', 'not', 'completely', 'backward', '-',
'compatible', 'with', 'earlier', 'versions', '.', 'python', '2', 'was', 'discontinued', 'with', 'version', '2.7.18', 'in',
'2020', '.']
sentence_list
[Python is a high-level, general-purpose programming language., Its design philosophy emphasizes code
readability with the use of significant indentation., Python is dynamically-typed and garbage-collected., It
supports multiple programming paradigms, including structured (particularly procedural), object-oriented
and functional programming., It is often described as a "batteries included" language due to its
comprehensive standard library., Guido van Rossum began working on Python in the late 1980s as a
successor to the ABC programming language and first released it in 1991 as Python 0.9.0.[36] Python 2.0
was released in 2000 and introduced new features such as list comprehensions, cycle-detecting garbage
collection, reference counting, and Unicode support. , Python 3.0, released in 2008, was a major revision
that is not completely backward-compatible with earlier versions., Python 2 was discontinued with
version 2.7.18 in 2020.]
36
sentence_scores={}
for sentence in sentence_list:
for word in sentence:
if word.text in word_freq.keys():
if len(sentence.text.split(' ')) < 30:
if sentence not in sentence_scores.keys():
sentence_scores[sentence] = word_freq[word.text]
else:
sentence_scores[sentence] += word_freq[word.text]
{Python is a high-level, general-purpose programming language.: 6.0, Its design philosophy emphasizes
code readability with the use of significant indentation.: 2.5, Python is dynamically-typed and garbage-
collected.: 4.25, It supports multiple programming paradigms, including structured (particularly
procedural), object-oriented and functional programming.: 6.625, It is often described as a "batteries
included" language due to its comprehensive standard library.: 3.25, Python 3.0, released in 2008, was a
major revision that is not completely backward-compatible with earlier versions.: 6.25, Python 2 was
discontinued with version 2.7.18 in 2020.: 2.5}
sum_sentences
[It supports multiple programming paradigms, including structured (particularly procedural), object-
oriented and functional programming., Python 3.0, released in 2008, was a major revision that is not
completely backward-compatible with earlier versions., Python is a high-level, general-purpose
programming language., Python is dynamically-typed and garbage-collected., It is often described as a
"batteries included" language due to its comprehensive standard library., Its design philosophy
emphasizes code readability with the use of significant indentation., Python 2 was discontinued with
version 2.7.18 in 2020.]
for w in sum_sentences:
print(w.text)
37
Python is dynamically-typed and garbage-collected.
It is often described as a "batteries included" language due to its comprehensive standard library.
Its design philosophy emphasizes code readability with the use of significant indentation.
len(summary)
613
len(doc1)
935
#Conclusion: -
Knowing the implementation of a technique for text summarization in natural language processing
38
Experiment No 4
TITLE: -
OBJECTIVE: -
THEORY: -
39
Using corpus methods, more complicated translations can be conducted, taking into
account better treatment of contrasts in phonetic typology, express acknowledgement,
and translations of idioms, just as the seclusion of oddities. Currently, some systems
are not able to perform just like a human translator, but in the coming future, it will
also be possible.
In simple language, we can say that machine translation works by using computer
software to translate the text from one source language to another target
language. There are different types of machine translation and in the next section, we
will discuss them in detail.
Different types of machine translation in NLP
40
Types of Machine Translation
NMT is a type of machine translation that relies upon neural network models
(based on the human brain) to build statistical models with the end goal of
translation. The essential advantage of NMT is that it gives a solitary system that
can be prepared to unravel the source and target text. Subsequently, it doesn't
rely upon specific systems that are regular to other machine translation systems,
particularly SMT.
41
What are the benefits of machine translation?
One of the crucial benefits of machine translation is speed as you have noticed that
computer programs can translate a huge amount of text rapidly. Yes, the human
translator does their work more accurately but they cannot match the speed of the
computer.
If you especially train the machine to your requirements, machine translation gives the
ideal blend of brisk and cost-effective translations as it is less expensive than using a
human translator. With a specially trained machine, MT can catch the setting of full
sentences before translating them, which gives you high quality and human-sounding
yield. Another benefit of machine translation is its capability to learn important
words and reuse them wherever they might fit.
Applications of machine translation
Machine translation technology and products have been used in numerous application
situations, for example, business travel, the travel industry, etc. In terms of the object
of translation, there are composed language-oriented content text translation and
spoken language.
Text translation
42
incorporate the translation of inquiry and recovery inputs and the translation of
(OCR) outcomes of picture optical character acknowledgement. Text-level
translation applications incorporate the translation of a wide range of
unadulterated reports, and the translation of archives with organized data.
Also, the methods and nature of getting data at the sentence level granularity
from the preparation corpus are more effective than that dependent on other
morphological levels, for example, words, expressions, and text passages.
Finally, the translation depends on sentence-level can be normally reached out to
help translation at other morphological levels.
Speech translation
With the fast advancement of mobile applications, voice input has become an
advantageous method of human-computer cooperation, and discourse translation
has become a significant application situation. The fundamental cycle of
43
discourse interpretation is "source language discourse source language text-
target language text-target language discourse".
In this cycle, programmed text translation from source language text to target-
language text is an important moderate module. What's more, the front end and
back end likewise need programmed discourse recognition, ASR and text-to-
speech, TTs.
Other applications
Naturally, the task of machine translation is to change one source language word
succession into another objective language word grouping which is semantically
the same. Generally, it finishes a grouping transformation task, which changes
over a succession object into another arrangement object as indicated by some
information and rationale through model and algorithms.
All things considered, many undertaking situations total the change between
grouping objects, and the language in the machine translation task is just one of
the succession object types. In this manner, when the ideas of the source
language and target language are stretched out from dialects to other
arrangement object types, machine translation strategies and techniques can be
applied to settle numerous comparable change undertakings.
Machine translation hits that sweet spot of cost and speed, offering a truly snappy
path for brands to translate their records at scale without much overhead. Yet, that
44
doesn't mean it's consistently relevant. On the other hand, human translation is
incredible for those undertakings that require additional consideration and subtlety.
Talented translators work on your image's substance to catch the first importance and
pass on that feeling or message basically in another assortment of work.
Leaning upon how much content should be translated, the machine translation can
give translated content very quickly, though human translators will take additional
time. Time spent finding, verifying, and dealing with a group of translators should
likewise be considered.
Machine Translation is the instant modification of text from one language to another
utilizing artificial intelligence whereas a human translation, includes actual
brainpower, in the form of one or more translators translating the text manually.
Conclusion
45
Machine translators analyze the structure of the first content, at that point, separate
this content into single words or short expressions that can be easily translated. At
last, they reconstruct those single words or short expressions using the very same
structure in the picked target language.
Experiment No 5
TITLE: -
OBJECTIVE: -
THEORY: -
46
Aspect Modelling in Sentiment Analysis (ABSA):
Aspect modelling is an advanced text-analysis technique that refers to the process of
breaking down the text input into aspect categories and its aspect terms and then
identifying the sentiment behind each aspect in the whole text input. The two key terms
in this model are:
Sentiments: A positive or negative review about a particular aspect
Aspects: the category, feature, or topic that is under observation.
Requirement
In the business world, there is always a major need to identify to observe the sentiment of
the people towards a particular product or service to ensure their continuous interest in
their business product. ABSA fulfils this purpose by identifying the sentiment behind
each aspect category like food, location, ambience etc.
This helps businesses keep track of the changing sentiment of the customer in each field
of their business.
ABSA Architecture
The ABSA model includes the following steps in order to obtain the desired output.
Step 1 - Consider the input text corpus and pre-process the dataset.
Step 2 - Create Word Embeddings of the text input. (i.e. vectorize the text input
and create tokens.)
Step 4 - Combine 3.a and 3.b to create to get Aspect Based Sentiment.(OUTPUT)
Intuition:
Aspect: It is defined as a concept on which an opinion or a sentiment is based. Let us
take an example for better understanding.
Suppose a company builds a web app that works slow but offers reliable results with high
accuracy. Here, we break this text into two aspects. “Web app works slow” and “reliable
results with high accuracy“. On observing the two aspect categories, you can easily
conclude that they have different sentiment associated with them. (As shown in Fig 1)
47
Tools & Dataset Used:
spaCy (tokenization, sentence boundary detection, dependency parser,
etc.)
48
Since Python is my go-to language, all of the tools and libraries I used are
available for Python. If you’re more comfortable with other languages, there
are many other approaches you can take, i.e. using Stanford CoreNLP which
normally runs on Java, instead of spaCy which runs on python.
import spacy
sp = spacy.load("en_core_web_sm")
mixed_sen = [
# from sentences.
ext_aspects = []
49
for sen in mixed_sen:
important = sp(sentence)
descriptive_item = ''
target = ''
target = token.text
if token.pos_ == 'ADJ':
added_terms = ''
if mini_token.pos_ != 'ADV':
continue
ext_aspects.append({'aspect': target,
'description': descriptive_item})
print("ASPECT EXTRACTION\n")
print(ext_aspects)
50
for aspect in ext_aspects:
aspect['sentiment'] = TextBlob(aspect['description']).sentiment
print("\n")
print("SENTIMENT ASSOCIATION\n")
print(ext_aspects)
Output:
Topic Modeling:
Topic modeling is a machine learning technique that automatically analyzes text data to
determine cluster words for a set of documents. This is known as ‘unsupervised’ machine
learning because it doesn’t require a predefined list of tags or training data that’s been previously
classified by humans. Since topic modeling doesn’t require training, it’s a quick and easy way to
start analyzing your data. However, you can’t guarantee you’ll receive accurate results, which is
why many businesses opt to invest time training a topic classification model. Since topic
classification models require training, they’re known as ‘supervised’ machine learning
techniques. What does that mean? Well, as opposed to text modeling, topic classification needs
to know the topics of a set of texts before analyzing them. Using these topics, data is tagged
manually so that a topic classifier can learn and later make predictions by itself. For example,
let’s say you’re a software company that’s released a new data analysis feature, and you want to
analyze what customers are saying about it. You’d first need to create a list of tags (topics) that
are relevant to the new feature e.g. Data Analysis, Features, User Experience, then you’d need to
use data samples to teach your topic classifier how to tag each text using these predefined topic
tags. Although topic classification involves extra legwork, this topic analysis technique delivers
more accurate results than unsupervised techniques, which means you’ll get more valuable
insights that help you make better, data-based decisions.
51
Topic Modeling
The underlying assumption is that every document comprises a statistical mixture of topics, i.e. a
statistical distribution of topics that can be obtained by “adding up” all of the distributions for all
the topics covered.
Now, it’s time to build a model for topic modeling! We’ll be using the preprocessed data from
the previous tutorial. Our weapon of choice this time around is Gensim, a simple library that’s
perfect for getting started with topic modeling.
Now, pick up from halfway through the classifier tutorial where we turned reviews into a list of
stemmed words without any connectors:
52
With this list of words, we can use Gensim to create a dictionary using the bag of words model:
Next, we can use this dictionary to train an LDA model. We’ll instruct Gensim to find three
topics (clusters) in the data:
topics = model.print_topics(num_words=3)
for topic in topics:
print(topic)
And that’s it! We have trained our first model for topic modeling! The code will return some
words that represent the three topics:
For each topic cluster, we can see how the LDA algorithm surfaces words that look a lot like
keywords for our original topics (Facilities, Comfort, and Cleanliness).
Since this is an example with just a few training samples we can’t really understand the data, but
we’ve illustrated the basics of how to do topic modeling using Gensim.
Conclusion: Here we have studied the two methods of NLP like aspect mining and topic
modeling.
53
54