SENTIMENT ANALYSIS ON MOVIE REVIEWS
Natural Language Processing UML602
Project Report
BE Third Year, COE
Submitted by:
101603120 Himanshu Dhiman
101603125 Himanshu Pandey
Submitted to:
Dr. Aashima Sharma
Computer Science and Engineering Department
TIET, Patiala
April, 2019
1. INTRODUCTION
Sentiment Analysis means analyzing the sentiment of a given text or document and categorizing
the text/document into a specific class or category (like positive and negative). In other words, we
can say that sentiment analysis classifies any particular text or document as positive or negative.
Basically, the classification is done for two classes: positive and negative. By definition Sentiment
analysis refers to the use of natural language processing, text analysis, computational linguistics,
and biometrics to systematically identify, extract, quantify, and study affective states and
subjective information. Sentiment Analysis is also referred as Opinion Mining. It’s mostly used in
social media and customer reviews data.
1.1 Steps involved during sentiment analysis
Figure 1.1
1.2 Libraries used
Natural Language Toolkit (NLTK)
NLTK is a leading platform for building Python programs to work with human language data. It
provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along
with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing,
and semantic reasoning.
2. STEPS OF WORKING
In this project, NLTK’s movie_reviews corpus is used as our labeled training data. The
movie_reviews corpus contains 2,000 movie reviews with sentiment polarity classification. The
two categories for classification are - positive and negative. The movie_reviews corpus already
has the reviews categorized as positive and negative. The reviews are categorized using supervised
classification technique. In supervised classification, the classifier is trained with labeled training
data.
The below shown figure depicts the working followed during training and testing of the model.
Figure 2.2
2.1 Pre-processing of data
Three different ways are used pre-process the data to achieve maximum training and testing
accuracy.
2.1.1 Using 2000 most frequently occurring words:
1. Convert movie review data into useful format
2. Remove Stopwords and Punctuation
3. Create word feature using 2000 most frequently occurring words
2.1.2 Bag of words feature
1. Create unique list based on positive and negative review
2. Shuffle both list separately and add equal no of reviews
3. Train classifier and test model
2.1.3 n-gram feature
1. Create unique list based on positive and negative review
2. We define two functions
bag_of_words: that extracts only unigram features from the movie review words
bag_of_ngrams: that extracts only bigram features from the movie review words
We then define another function
bag_of_all_words: that combines both unigram and bigram features
4. Train classifier and test model
2.2 Training of model
The model is trained using NLTK’s Naïve Bayes Classifier which is an in-built classifier of the
module. It’s a simple, fast, and easy classifier which performs well for small datasets. It’s a
simple probabilistic classifier based on applying Bayes’ theorem. Bayes’ theorem describes the
probability of an event, based on prior knowledge of conditions that might be related to the
event.
2.3 Testing of model
The model accuracy is tested on training data as well as on custom data input by the user.
3. CODE
Pre-processing of data
The below shown code creates frequency distribution of all the words in the document and removes
stop-words and punctuations from the text and as a result data is cleaned and cleaned words are
added to a new list.
Figure 3.1
Creating document feature using top-N occurring words
The below shown code creates the document feature using 2000 frequently occurring words and
then trains the model using Naïve Bayes classifier and prints the accuracy of the model.
Figure 3.2
Creating feature word using bag of words method
The code shown below categorizes the text as positive and negative in different lists which helps
to reduce positive and negative data in separately and then pre-processes the data.
Figure 3.3
Bi-Gram Feature
In bag of words feature extraction, we used only unigrams. In the example below, we will use
both unigram and bigram feature, i.e. we will deal with both single words and double words.
Figure 3.4
Training the model
After pre-processing, the created feature sets are trained using NLTK’s Naïve Bayes classifier.
Figure 3.5
Figure 3.6
4. Results
top-N most frequently occurring words –
Figure 4.1
We can see that custom negative reviews are categorized accurately but in case of positive
custom review we get inaccurate results.
In the top-N feature, we only used the top 2000 words in the feature set.
We combined the positive and negative reviews into a single list, randomized the list, and then
separated the train and test set.
This approach can result in the un-even distribution of positive and negative reviews across the
train and test set.
Bag of words Feature –
Figure 4.2
Now using bag of words feature we get appropriate results on custom test reviews but the overall
accuracy of the model is decreased to 70%
Bi-gram Feature –
Figure 4.3
The accuracy of the classifier has significantly increased when trained with combined feature set
(unigram + bigram).
Accuracy was 70% while using only Unigram features.
Accuracy has increased to 77% while using combined (unigram + bigram) features.
5. Applications & Future Scope
5.1 Brand Monitoring - or you could also call it Reputation management. We all know how
much good reputation means these days when the majority of us check social media reviews as
well as review sites before making a purchase decision.
5.2 Customer support - Social media are channels of communication with your customers
these days, and whenever they’re unhappy about something related to you, whether or not
it’s your fault, they’ll call you out on Facebook/Twitter/Instagram.
Such mentions will appear in your dashboard with a flashing red color, and you better start
engaging them as soon as they are there.
People nowadays expect brands to respond on social media almost immediately, and if
you’re not quick enough, you might as well see them moving on to your competitors instead
of waiting for your reply.