Improved Techniques For Online Review Spam Detection
Improved Techniques For Online Review Spam Detection
Smriti Singh
Master of Technology
Under the Dual Degree Programme
in
Smriti Singh
(Roll: 710CS1033)
I certify that:
The work enclosed in this thesis has been done by me under the supervision of
my project guide.
The work has not been submitted to any other Institute for any degree or
diploma.
I have confirmed to the norms and guidelines given in the Ethical Code of
Conduct of National Institute of Technology, Rourkela.
May 5, 2015
Certificate
This is to certify that the work in the thesis entitled Improved Techniques for
Online Review Spam Detection by Smriti Singh is a record of an original
research work carried out under my supervision and guidance in partial fulfilment
of the requirements for the award of the degree of Master of Technology, under
the Dual Degree Programme, in Computer Science and Engineering. Neither
this thesis nor any part of it has been submitted for any degree or academic award
elsewhere.
I owe deep gratitude to the ones who have contributed greatly in completion of
this thesis.
Foremost, I would also like to express my gratitude towards my project advisor,
Prof. Sanjay Kumar Jena, whose mentor-ship has been paramount, not only in
carrying out the research for this thesis, but also in developing long-term goals for my
career. His guidance has been unique and delightful. I would also like to thank my
mentor, Jitendra Rout Sir, who provided his able guidance whenever I needed it. He
inspired me to be an independent thinker, and to choose and work with independence.
I would also like to extend special thanks to my project review panel for their
time and attention to detail. The constructive feedback received has been keenly
instrumental in improvising my work further.
I would like to specially thank my friend Shaswat Rungta for his profound insight
and for guiding me to improve the final product, as well as my other friends for their
support and encouragement.
My parents receive my deepest love for being the strength in me.
Smriti Singh
Abstract
The rapid upsurge in the number of e-commerce websites, has made the internet,
an extensive source of product reviews. Since there is no scrutiny regarding the
quality of the review written, anyone can basically write anything which conclusively
leads to Review Spams. There has been an advance in the number of Deceptive
Review Spams - fictitious reviews that have been deliberately fabricated to seem
genuine. In this work, we have delved into both supervised as well as unsupervised
methodologies to identify Review Spams. Improved techniques have been proposed
to assemble the most effective feature set for model building. Sentiment Analysis and
its results have also been integrated into the spam review detection. Some well known
classifiers have been used on the tagged dataset in order to get the best performance.
We have also used clustering approach on an unlabelled Amazon reviews dataset.
From our results, we compute the most decisive and crucial attributes which lead us
to the detection of spam and spammers. We also suggest various practices that could
be incorporated by websites in order to detect Review Spams.
Certificate iii
Acknowledgement iv
Abstract v
List of Figures ix
List of Tables x
1 Introduction 1
1.1 What is Review Spam? . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Challenges in Review Spam Detection . . . . . . . . . . . . . . . . . . 2
1.3 Motivation and Objective . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Thesis Organisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Literature Review 6
2.1 Types of Spams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Email Spam . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 Comment Spam . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.3 Instant Messaging Spam . . . . . . . . . . . . . . . . . . . . . 6
2.1.4 Junk Fax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.5 Unsolicited Text Messages Spam or SMS Spam . . . . . . . . 7
2.1.6 Social Networking Spam . . . . . . . . . . . . . . . . . . . . . 7
2.2 Types of Review Spams . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Types of Spammers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
vi
2.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 Spam Detection Methods . . . . . . . . . . . . . . . . . . . . . . . . . 16
3 Supervised Method 18
3.1 Automated Approaches to Deceptive Review Spam Detection . . . . . 18
3.1.1 Linguistic Characteristics as Features . . . . . . . . . . . . . . 18
3.1.2 Genre Identification: POS Tagging as a Feature . . . . . . . . 18
3.1.3 Text Categorisation: N-gram as a Feature . . . . . . . . . . . 19
3.1.4 Sentiment as a Feature . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.1 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.2 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . 21
3.2.3 Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4 Proposed Work 24
4.1 Dataset Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.1.1 Dataset Sources . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.1.2 Dataset Description . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Proposed Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3 Feature Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3.1 Linguistic Characteristics as Features . . . . . . . . . . . . . . 27
4.3.2 Genre Identification: POS Tagging as a Feature . . . . . . . . 27
4.3.3 Text Categorisation: N-gram as a Feature . . . . . . . . . . . 28
4.3.4 Sentiment as a Feature . . . . . . . . . . . . . . . . . . . . . . 28
5 Results 31
5.1 Linguistic Features Analysis . . . . . . . . . . . . . . . . . . . . . . . 31
5.2 POS Features Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.3 N-gram Features Analysis . . . . . . . . . . . . . . . . . . . . . . . . 32
5.4 Sentiment Features Analysis . . . . . . . . . . . . . . . . . . . . . . . 33
5.5 Unified Features Model Analysis . . . . . . . . . . . . . . . . . . . . . 33
6 Unsupervised Method 35
6.1 Dataset Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
vii
6.2 Dataset Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.3 Feature Vector Generation . . . . . . . . . . . . . . . . . . . . . . . . 39
6.3.1 Review Centric Features . . . . . . . . . . . . . . . . . . . . . 40
6.3.2 Reviewer Centric Features . . . . . . . . . . . . . . . . . . . . 41
6.3.3 Product Centric Features . . . . . . . . . . . . . . . . . . . . . 41
6.4 Outlier Spam Detection using k-NN Method . . . . . . . . . . . . . . 41
6.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Bibliography 46
viii
List of Figures
ix
List of Tables
x
Chapter 1
Introduction
In the last few years, Review Spam Detection has gathered a lot of attention.
Over the past few years, consumer review sites like Yelp.com have been removing
spurious reviews from their website using their own algorithms. Both supervised
as well as unsupervised learning approaches have been used previously for filtering
1
1.2 Challenges in Review Spam Detection Introduction
of Review Spams. For the purpose of training the features for machine learning
approaches, linguistic and behavioural features have been used.
2. Defaming spam, where unreasonable negative reviews are given to the competing
products to harm their reputations among the consumers [3]
Specifically, the reviews that have been written either to popularize or benefit a
brand or a product , therefore expressing a positive sentiment for a product, are called
positive deceptive review spams. As opposed to that, reviews that intend to malign
or defame a competing product expressing a negative sentiment towards the product,
are called negative deceptive review spams[4].
2
1.3 Motivation and Objective Introduction
4. 31% of consumers read online reviews before actually making a purchase (rising)
5. By the end of 2014, 15% of all social media reviews will consist of company paid
fake reviews
The reviews that have been positively written, often bring lot of profits and
reputation for the individuals and the businesses. Sadly, this also provides an
incentives for the spammers to be able to post fake or fabricated reviews and
opinions. Unwarranted positive reviews and unjustified negative reviews, is how
opinion spamming has become a business in recent years. Surprisingly there are a
large number of consumers who are completely wary of such biased, paid or fake
reviews.
3
1.3 Motivation and Objective Introduction
1. To investigate some of the most novel techniques for Spam Detection in online
reviews.
2. Our main objective is to build the most effective features set for training model.
3. To detect Review Spams using well known classifiers for labelled dataset.
4
1.4 Problem Statement Introduction
5
Chapter 2
Literature Review
Direct mail messages are used to target individual users in Email Spam. The list for
email spams is often prepared by scanning the web for Usenet postings, web search
of addresses as well as stealing of web addresses.
This type of spam makes use of instant messaging systems. Instant messaging is a
for of chat based direct communication between two people in real time, using either
personal computers or any other devices. The network communicates messages only
in the form of text. It is very common on many instant messaging systems such as
Skype.
Junk fax is a means of marketing via unsolicited advertisements that are sent through
fax. So the junk faxes are basically the faxed equivalent of a spam mail. It is a medium
of telemarketing and ads.
6
2.2 Types of Review Spams Literature Review
This type of spam (SMS) is hard to filter. Due to the low cost fo internet and fast
progress in trms of technology, it is now very easily possible to send SMS spams at
indispensable amounts using the Internet’s SMS portals. It is fast becoming a big
challenge that needs to be overcome.
Social Networking spam is targeted for the regular users of the social networking
websites such as LinkedIn, Facebook, Google+ or MySpace. It often happens that
these users of the social networking web services send direct messages or weblinks
that contain embedded links or malicious and spam URLs to other locations on the
web or to one another. This is how a social spammer plays his role[5].
Type 2 (Reviews with brand mentions): These spams have only brands as their
prime focus. They comment about the manufacturer or seller or the brand name
alone. These reviews are biased and can easily b figured out as they do not talk
about the product and rather only mention the brand names.
Type 3 (Non-reviews): These reviews are either junk, as in, have no relation with
the product or are purely used for advertisement purposes. They have these two
forms:
7
2.3 Types of Spammers Literature Review
From Figure 2.1, we can infer that regions 1 and 4 are not very harmful. Regions
2 and 3 are very damaging for th reputation of a product. Regions 5 and 6 are mildly
harmful but do bring about significant losses or profits for a brand or a product[7]. In
this thesis, we have basically focussed on identifying these regions that are damaging
for the product reputation.
1. An individual spammer
Either only positive reviews are written about a product or only negative
reviews about the competitor’s products.
2. A group of spammers
To control the sales of a product, the spammers write reviews during the
launch time of the product.
8
2.4 Related Work Literature Review
Every spam group member write reviews so that the overall product rating
deviation lowers down.
They divide group in sub-groups and then each of these sub divisions work
on different web sites.
in 2010, Jindal et al.[7] did an early work on detecting review spammers which
proposed scoring techniques for the spamicity degree of each reviewer. The authors
tested their model on Amazon reviews, which were initially taken through several
data preprocessing steps. In this stage, they decided to only keep reviews from highly
9
2.4 Related Work Literature Review
active users - users that had written at least 3 reviews. The detection methods
are based on several predefined abnormalities indicators, such as general rating
deviation, early deviation - i.e. how soon after a product appears on the website
does a suspicious user post a review about it or very high/low ratings clusters. The
features weights were linearly combined towards a spamicity formula and computed
empirically in order to maximize the value of the normalized discounted cumulative
gain measure. The measure showed how well a particular ranking improves on the
overall goal. The training data was constructed as mentioned earlier from Amazon
reviews, which were manually labelled by human evaluators. Although an agreement
measure is used to compute the inter-evaluator agreement percentage, so that a
review is considered fake if all of the human evaluators agree, this method of manually
labelling deceptive reviews has been proven to lead to low accuracy when testing
on real-life fake review data. First, Ott et al. demonstrated that it is impossible
for humans to detect fake reviews simply by reading the text. Second, Mukherjee
et al. proved that not even fake reviews produced through crowdsourcing methods
are valid training data because the models do not generalize well on real-life test data.
Wang et al.[9] considered the triangular relationship among stores, reviewers and
their reviews. This was the first study to capture such relationships between these
concepts and study their implications. They introduced 3 measures meant to do
this: the stores reliability, the trustworthiness of the reviewers and the honesty of
the reviews. Each concept depends on the other two, in a circular way, i.e. a store is
more reliable when it contains honest reviews written by trustworthy reviewers and
so on for the other two concepts. They proposed a heterogeneous graph based model,
called the review graph, with 3 types of nodes, each type of node being characterized
by a spamicity score inferred using the other 2 types. In this way, they aimed to
capture much more information about stores, reviews and reviewers than just focus
on behavioural reviewer centric features. This is also the first study on store reviews,
which are different than product reviews. The authors argue that when looking at
product reviews, while it may be suspicious to have multiple reviews from the same
10
2.4 Related Work Literature Review
person for similar products, it is ok for the same person to buy multiple similar
products from the same store and write a review every time about the experience. In
almost all fake product reviews, studies which use the cosine similarity as a measure
of review content alikeness, a high value is considered as a clear signal of cheating,
since the spammers do not spend much time writing new reviews all the time, but
reuse the exact same words. However, when considering store reviews, it is possible
for the same user to make valid purchases from similar stores, thus reusing the
content of his older reviews and not writing completely different reviews all the time.
Wang et al. used an iterative algorithm to rank the stores, reviewers and reviews
respectively, claiming that top rankers in each of the 3 categories are suspicious.
They evaluated their top 10 top and bottom ranked spammer reviewers results using
human evaluators and computed the inter-evaluator agreement. The evaluation of
the resulted store reliability score, again for the top 10 top and bottom ranked stores
was done by comparison with store data from Better Business Bureaus, a corporation
that keeps track businesses reliability and possible consumer scams.
Wang et al.[9] observed that the vast majority of reviewers (more than 90% in
their study or resellerratings.com reviews up to 2010) only wrote one review, so they
have focused their research on this type of reviewers. They also claim, similarly to
Feng et al.,[10], that a flow of fake reviews coming from a hired spammer distorts
the usual distribution of ratings for the product, leaving distributional traces behind.
Xie et al. observed the normal flow of reviews is not correlated with the given
ratings over time. Fake reviews come in bursts of either very high ratings, i.e.
5-stars, or very low ratings, i.e. 1-star, so the authors aim to detect time windows
in which these abnormally correlated patterns appear. They considered the number
of reviews, average ratings and the ratio of singleton reviews which stick out when
looking over different time windows. The paper makes important contributions to
opinion spam detection by being the first study to date to formulate the singleton
spam review problem. Previous works have disregarded this aspect completely by
purging singleton reviews from their training datasets and focusing more on tracking
11
2.4 Related Work Literature Review
Feng et al.[10] published the first study to tackle the opinion spam as a
distributional anomaly problem, considering crawled data from Amazon and
TripAdvisor. They claim product reviews are characterized by natural distributions
which are distorted by hired spammers when writing fake reviews. Their contribution
consists of first introducing the notion of natural distribution of opinions and second
of conducting a range of experiments that finds a connection between distributional
anomalies and the time windows when deceptive reviews were written. For the
purpose of evaluation they used a gold standard dataset containing 400 known
deceptive reviews written by hired people, created by Ott et al. Their proposed
method achieves a maximum accuracy of only 72.5% on the test dataset and thus
is suitable as a technique to pinpoint suspicious activity within a time window and
draw attention on suspicious products or brands. This technique does not solely
represent however a complete solution where individual reviews can be deemed as
fake or truthful, but simply brings to the foreground delimited short time windows
where methods from other studies can be applied to detect spammers.
In 2011, Huang et al.[11] used supervised learning and manually labelled reviews
crawled from Epinions to detect product review spam. They also added to the model
the helpfulness scores and comments the users associated with each review. Due
to the dataset size of about 60K reviews and the fact that manual labelling was
required, an important assumption was made - reviews that receive fewer helpful
votes from people are more suspicious. Based on this assumption, they have filtered
out review data accordingly, e.g. only considering reviews which have at least 5
helpfulness votes or comments. They achieved a 0.58 F-Score result using their
12
2.4 Related Work Literature Review
supervised method model, which outperformed the heuristic methods used at that
time to detect review spam. However, this result is very low when compared with
that of more recent review spam detection models. The main reason for this has
been the training of the model on manually labelled fake reviews data, as well as the
initial data pre-processing step where reviews were selected based on their helpfulness
votes. In 2013, Mukherjee et al., made the assumption that deceptive reviews get
less votes. But their model evaluation later showed that helpfulness votes not only
perform poorly but they may also be abused - groups of spammers working together
to promote certain products may give many votes to each others reviews. The same
conclusion has been also expressed by Jindal et al.[7] in 2010.
Ott et al.[12] produced the first dataset of gold-standard deceptive opinion spam,
employing crowdsourcing through the Amazon Mechanical Turk. They demonstrated
that humans cannot distinguish fake reviews by simply reading the text, the results of
these experiments showing an at-chance probability. The authors found that although
part-of-speech n-gram features give a fairly good prediction on whether an individual
review is fake, the classifier actually performed slightly better when psycholinguistic
features were added to the model. The expectation was also that truthful reviews
resemble more of an informative writing style, while deceptive reviews are more
similar in genre to imaginative writing. The authors coupled the part-of-speech tags
in the review text which had the highest frequency distribution with the results
obtained from a text analysis tool previously used to analyze deception. Testing
their classifier against the gold-standard dataset, they revealed clue words deemed
as signs of deceptive writing. However, this can be seen as overly simplistic, as some
of these words, which according to the results have a higher probability to appear
in a fake review, such as vacation or family, may as well appear in truthful reviews.
The authors finally concluded that the domain context has an important role in the
feature selection process. Simply put, the imagination of spammers is limited - e.g.
in the case of hotel reviews, they tend to not be able to give spatial details regarding
their stay. While the classifier scored good results on the gold-standard dataset, once
13
2.4 Related Work Literature Review
the spammers learn about them, they could simply avoid using the particular clue
words, thus lowering the classifier accuracy when applied to real-life data on the long
term.
Mukherjee et al.[13] were the first to try to solve the problem of opinion spam
resulted from a group collaboration between multiple spammers. The method they
proposed first extracts candidate groups of users using a frequent itemset mining
technique. For each group, several individual and group behavioural indicators are
computed, e.g. the time differences between group members when posting, the rating
deviation between group members compared with the rest of the product reviewers,
the number of products the group members worked together on, or review content
similarities. The authors also built a dataset of fake reviews, with the help of human
judges which manually labelled a number of reviews. They experimented both with
learning to rank methods, i.e. ranking of groups based on their spamicity score
and with classification using SVM and logistic regression, using the labelled review
data for training. The algorithm, called GSRank considerably outperformed existing
methods by achieving an area under the curve result (AUC) of 95%. This score makes
it a very strong candidate for production environments where the community of users
is very active and each user writes more than one review. However, not many users
write a lot of reviews, there exists a relatively small percentage of ”elite” contributing
users. So this method would best be coupled with a method for detecting singleton
reviewers, such as the method from Wang et al.
14
2.4 Related Work Literature Review
they tested Otts model on Yelp data. This led the authors to claim that any previous
model trained using reviews collected through the AMT tool can only offer near
chance accuracy and is useless when applied on real-life data. However, the authors
do not rule out the effectiveness of using n-gram features in the model and they
proved the largest accuracy obtained on Yelp data was achieved using a combination
of behavioural and linguistic features. Their experiments show little improvement
over accuracy when adding n-gram features. Probably the most interesting conclusion
is that behavioural features considerably outperform n-gram features alone.
Mukherjee et al. built an unsupervised model called the Author Spamicity Model
that aims to split the users into two clusters - truthful users and spammers. The
intuition is that the two types of users are naturally separable due to the behavioural
footprints left behind when writing reviews. The authors studied the distributional
divergence between the two types and tested their model on real-life Amazon reviews.
Most of the behavioural features in the model have been previously used in two
previous studies by Mukherjeeet al. in 2012 and Mukherjee et al. in 2013. In these
studies though, the model was trained using supervised learning. The novelty about
the proposed method in this paper is a posterior density analysis of each of the
features used. This analysis is meant to validate the relevance of each model feature
and also increase the knowledge on their expected values for truthful and fake reviews
respectively.
Fei et al.[15] focused on detecting spammers that write reviews in short bursts.
They represented the reviewers and the relationships between them in a graph and
used a graph propagation method to classify reviewers as spammers. Classification
was done using supervised learning, by employing human evaluation of the identified
honest/deceptive reviewers. The authors relied on behavioural features to detect
periods in time when review bursts per product coincided with reviewer burst, i.e.
a reviewer is very prolific just as when a number of reviews which is higher than
the usual average of reviews for a particular product is recorded. The authors
15
2.5 Spam Detection Methods Literature Review
discarded singleton reviewers from the initial dataset, since these provide little
behaviour information - all the model features used in the burst detection model
require extensive reviewing history for each user. By discarding singleton reviewers,
this method is similar to the one proposed by Mukherjee et al. in 2012. These
methods can thus only detect fake reviews written by elite users on a review platform.
Exploiting review posting bursts is an intuitive way to obtain smaller time windows
where suspicious activity occurs. This can be seen as a way to break the fake review
detection method into smaller chunks and employ other methods which have to work
with considerably less data points. This would decrease the computational and time
complexity of the detection algorithm.
16
2.5 Spam Detection Methods Literature Review
Detect rating and content outliers. (Reviews that have ratings that defer
greatly from the average product ratings)
Compare the review ratings given by the same reviewer on products from
various other stores
17
Chapter 3
Supervised Method
The linguistic and functional properties of text such as its complexity or average
number of words per sentence, number of digits, etc.) are an important feature to be
incorporated for review spam classification.
Deceptive reviews contain more words, i.e. more quantity. The complexity in
deceptive reviews is found to be greater than truthful reviews. Truthful reviews must
essentially have more number of unique words (diversity) than deceptive reviews
where the spammers have little knowledge about the product. Brand names are
mentioned more frequently in deceptive reviews than the truthful ones. Average
word length is more in case of truthful reviews. No. of digits mentioned in
truthful reviews is more than deceptive as a reviewer writing a truthful review will
have more information about the product and hence more digits will be mentioned[3].
The distribution of parts of speech count (POS Tags) in texts depicts its genre.
Strong linguistic differences have been found between imaginative and informative
writings, as depicted in the works of Rayson et al. in 2001. Informative texts contain
more of nouns, prepositions, adjectives, determiners and coordinating
conjunctions, while the imaginative texts have more of pronouns, verbs, adverbs
18
3.2 Classifiers Supervised Method
N-grams as a feature helps us model the entire content as well as its context using
the Text categorisation method. Thus, we consider UNIGRAMS and BIGRAMS in
our N-gram feature sets.
Standard techniques for N-gram text categorization have been used to locate
Deceptive Review Spams with approximate accuracy of about 86%.
The fake negative reviewers are seen to over-produce terms depicting negative
emotions (e.g., horrible, disappointed, etc.) as compared to the truthful reviews.
Similarly, fictitious positive reviewers over-produced terms depicting emotions of
positiveness (e.g., beautiful, elegant, etc.). Therefore, fake hotel reviewers exaggerate
the sentiment.
3.2 Classifiers
Features from the four approaches just introduced, linguistic approach, POS tag,
polarity and n-gram, are utilized to train classifiers such as Naive Bayes, Decision
Tree and Support Vector Machine (SVM).
Based on the Bayes theorem, the Naive Bayesian classifier is assumes independence
assumptions among different predictors. It is an easy to build model, having no
parameter calculation which is complicated enough, and thus can be easily used for
huge datasets in particular. Even though this model is highly simplistic, the Naive
19
3.2 Classifiers Supervised Method
Bayesian classifier performs surprisingly well to be used everywhere and can even
outperform the more complicated or sophisticated classification models.
Algorithm:
In Bayes theorem, we ultimately calculate the posterior probability, i.e, P(c |x), from
P(c), P(x), and P(x |c).
Here, P(x) is the prior probability, P(x |c) denotes likelihood and P(c) is the class
prior probability.
This classifier works on the assumption that value of a feature (x) and its value for
a given class will be independent with respect to the values of other feature values.
We call this assumption as class conditional independence.
where,
P(c |x) : posterior probability of a class given the attributes
P(c) : prior probability of a class
P(x |c) : likelihood, i.e. probability of that feature predictor given a particular class
P(x) : prior probability of feature
Advantages:
20
3.2 Classifiers Supervised Method
Disadvantages:
point, it will be a bad generalisation and might be sensitive to noise and thus incorrect.
Thus, our objective will be to be able to get a straight line that is farthest possible
from the class points while dividing the class.
The goal of our SVM classifier is to find a hyperplane giving farthest minimum distance
between the training class points. We also find something called ”margin” in the SVM
classifier theory that is twice this separating distance. Finally this hyperplane that
21
3.2 Classifiers Supervised Method
22
3.2 Classifiers Supervised Method
Entropy:
Top-down approach is used to build the decision tre starting from root node. Data is
partitioned into smaller sets having homogeneous values. ID3 algorithm incorporates
Entropy algorithm in order to compute the homogeneity of given data. Entropy
becomes zilch is we find that the data is entirely homogeneous. If it is divided in an
equal fashion, entropy becomes one.
Information Gain:
A decrease in the entropy value after splitting the dataset on a feature creates
information gain. We try to create a decision tree that finds features such that we
are able to retrieve the maximum information gain through the most homogeneous
branches.
Step 1: The target’s entropy value is formulated.
Step 2: We divide the dataset on the basis of our feature attributes while calculating
entropy of every branch. Total entropy of the division is obtained by proportional
addition. Now the entropy value we have calculated needs to be subtracted from
pre-split entropy value. Resulting value obtained is our Information Gain, i.e. the
decrease in entropy.
Step 3: We choose the feature that gives us the maximum information gain value
and make it our decision node.
Step 4a: If entropy = 0, we term it a leaf node.
Step 4b: If entropy¿0, further splitting needs to be done.
Step 5: We recursively run the ID3 algorithm on decision branches till we classify
all th data.
23
Chapter 4
Proposed Work
The area of review spam detection has a very few labelled datasets available. Most
of these labellings have either been manually done or is a work of heuristics. We can
however, obtain our dataset from various websites mentioned below[12] :
24
4.1 Dataset Collection Proposed Work
ratings between 1 to 5 can be given by users. After that they could descriptively write
about a product or service. Also, users could check-in, just like Tripadvisor.com, into
a restaurant, hotel or a location that they are visiting. Yelp gets about 132 million
visitors on a monthly basis and about a total of half a billion reviews.
Although, Yelp does not give away datasets to the public, we can scrape user
information and reviews from their website. hough Yelp does not provide its dataset
publicly, the reviews and user information can be scraped from the site itself. Bots
and scripts can be used to scrape the data as they are allowed with low security so
as to get more penetration in search engine results.
We compiled a collection of a total of 1600 reviews from the sources mentioned above.
These reviews were for 20 Chicago-based hotels. The following are the features of
each review:
25
4.1 Dataset Collection Proposed Work
5. The binary label for depicting whether the review is a spam or not
The corpus contains 80 reviews for each of the 20 Chicago-based hotels: Afinia,
Amalfi, Allegro, Ambassador, Fairmont, Conrad, Fardrock, Homewood, Hilton,
James, Monaco, Hyatt, Intercontinental, Knicker-bocker, Omni, Sharaton, Palmer,
Softel, Talbott and Swissotel. These 80 reviews contain 40 spam and 40 non-spam
reviews. Each of those 40 reviews have 20 positive and 20 negative reviews [12].
This dataset becomes useful for our research for the following reasons:
1. Our data has reviews in equal numbers for each hotel and thus it is a
well-balanced dataset.
2. Class imbalance does not exist as we have spam and non-spam reviews in an
equal number with each having negative and positive reviews in a balanced
number.
3. While obtaining data from th Amazon Mechanical Turk, we ask the Turkers to
review in such a fashion that it seems genuine and can be easily accepted as a
good and acceptable review by the website.
4. In the process, the AMT Turkers could also view other reviews already written
about the same hotel. Thus a manipulated review, similar to earlier written
reviews, made the tasks of the AMT workers much simpler. Thus the knowledge
base of the AMT worker also increases to write further genuine-sounding reviews.
26
4.2 Proposed Work Proposed Work
5. To ensure ingenuity of the genuine reviews being taken, non-5-star ratings were
eliminated.
6. Too short reviews or the ones that were too long were removed.
27
4.3 Feature Collection Proposed Work
Steps:
2. We maintain a dictionary for our unigrams and bigram features obtained from
the training dataset.
3. Now, from the test site, each review taken is then split into the corresponding
N-grams. For each N-gram, its corresponding score is checked.
5. Finally, we calculate the total scores to get an idea whether the test review is
more similar with spam set or the non-spam set to be able to figure out whether
it is genuine or fake.
Deceptive reviews have been found to contain a greater percentage of words showing
positive sentiments than positive genuine reviews. Similarly deceptive negative
reviews contain more negative terms than genuine negative reviews[2] [4].
28
4.3 Feature Collection Proposed Work
Steps:
3. Strength of the sentiment word on the feature decreases with the distance from
the feature word.
5. Finally the aggregation if all feature scores and then its mean gives us the
sentiment score in the range [-1,+1].
Here,
r = review
f = aspect/feature in a sentence
o(wj ): sentiment polarity of a word wj (+1 or -1)
cn : no. of negation words in one feature, default = 0
dist(wj ,f) = distance between feature f and word wj .
29
4.3 Feature Collection Proposed Work
30
Chapter 5
Results
This Linguistic features analysis works averagely and the results obtained are
presented in Table 5.1. Although, we observe that this analysis is comparable to the
classification done manually by human annotators in classifying the same dataset. Ott
et al. discovered that humans have an accuracy level of less than 60% for the same
dataset classification task. When multiple groups were asked to classify the dataset,
their concurrence of results was pretty low. Thus, our linguistic features model is in
tune with the human intuition in deceptive reviews detection.
31
5.2 POS Features Analysis Results
32
5.4 Sentiment Features Analysis Results
2. We find that spammers and non-spammers may have used similar words, but
the frequency of its usage from the word-sets makes a huge difference.
3. We can use the N-gram model in general in all types of scenarios, let alone hotel
reviews as the basic idea remains same and this method works well on all types
of datasets.
33
5.5 Unified Features Model Analysis Results
34
Chapter 6
Unsupervised Method
35
6.2 Dataset Analysis Unsupervised Method
The dataset had many entries that did not have the user ids. Such entries were
removed to maintain consistency in the analysis.
This dataset provides more information than the previous dataset. Apart from
review data, it also provides reviewer’s as well as the product’s information.
36
6.2 Dataset Analysis Unsupervised Method
37
6.2 Dataset Analysis Unsupervised Method
3. Product rating vs. Percentage of reviews 60.77% reviews have a rating of 4 and
above
The Similarity score is the percentage of similar words used. The Similarity
score, S is measured out of 100, where a score of hundred means the reviews
are identical. The above plot shows the mapping of around 10000 review pairs
against their Similarity scores. The plot shows a peak at the middle range values
which is to be expected. Beyond the 60 score the plot tapers closely to the x
axis. On analysis, around 0.5% of total review pairs have a Similarity score of
more than 70. However small this percentage may look, this means around 5500
reviews are near identical copies of previously existing reviews, which in itself
is a large number given that the number of products is around 7500.
38
6.3 Feature Vector Generation Unsupervised Method
1. Review Centric
2. Reviewer Centric
3. Product Centric
As we can infer from the names, review centric features comprise of features related
purely to the review text, reviewer centric features contain attributes related to the
reviewers and finally, product centric features contain information about the product.
39
6.3 Feature Vector Generation Unsupervised Method
Textual features:
F10. Percentage of positive opinion bearing words, e.g.: ”beautiful”, ”great”, etc.
F11. Percentage of negative words used in the review, e.g., ”bad” and ”poor”, etc.
F12. Percentage of numerals used,
F13. Number of capitals used
F14. Number of all capitals in the review text
40
6.4 Outlier Spam Detection using k-NN Method Unsupervised Method
F19. Ratio of number of reviews written by a reviewer which were first reviews
F20. Ratio of number of times he/she was the only reviewer
F21. Average rating given by a reviewer
F22. Standard deviation in rating given by reviewer
Figure 6.6: A snapshot of the features extracted from the Amazon dataset
41
6.5 Results Unsupervised Method
6.5 Results
42
6.5 Results Unsupervised Method
43
Chapter 7
44
7.3 Future Work Conclusion and Future Work
45
Bibliography
[1] Kyung-Hyan Yoo and Ulrike Gretzel. Comparison of deceptive and truthful travel reviews.
Information and communication technologies in tourism 2009, pages 37–47, 2009.
[2] Qingxi Peng and Ming Zhong. Detecting spam review through sentiment analysis. Journal of
Software, 9(8):2065–2072, 2014.
[3] C Harris. Detecting deceptive opinion spam using human computation. In Workshops at AAAI
on Artificial Intelligence, 2012.
[4] Myle Ott, Claire Cardie, and Jeffrey T Hancock. Negative deceptive opinion spam. In
HLT-NAACL, pages 497–501, 2013.
[5] M Daiyan, SK Tiwari, and MA Alam. Mining product reviews for spam detection using
supervised.
[6] Nitin Jindal and Bing Liu. Opinion spam and analysis. In Proceedings of the 2008 International
Conference on Web Search and Data Mining, pages 219–230. ACM, 2008.
[7] Ee-Peng Lim, Viet-An Nguyen, Nitin Jindal, Bing Liu, and Hady Wirawan Lauw. Detecting
product review spammers using rating behaviors. In Proceedings of the 19th ACM international
conference on Information and knowledge management, pages 939–948. ACM, 2010.
[8] Manali S Patil and AM Bagade. Online review spam detection using language model and feature
selection. International Journal of Computer Applications (0975–8887) Volume, 2012.
[9] Sihong Xie, Guan Wang, Shuyang Lin, and Philip S Yu. Review spam detection via temporal
pattern discovery. In Proceedings of the 18th ACM SIGKDD international conference on
Knowledge discovery and data mining, pages 823–831. ACM, 2012.
[10] Song Feng, Longfei Xing, Anupam Gogar, and Yejin Choi. Distributional footprints of deceptive
product reviews. In ICWSM, 2012.
[11] Fangtao Li, Minlie Huang, Yi Yang, and Xiaoyan Zhu. Learning to identify review spam. In
IJCAI Proceedings-International Joint Conference on Artificial Intelligence, volume 22, page
2488, 2011.
[12] Myle Ott, Yejin Choi, Claire Cardie, and Jeffrey T Hancock. Finding deceptive opinion
spam by any stretch of the imagination. In Proceedings of the 49th Annual Meeting of the
Association for Computational Linguistics: Human Language Technologies-Volume 1, pages
309–319. Association for Computational Linguistics, 2011.
46
Bibliography
[13] Arjun Mukherjee, Bing Liu, and Natalie Glance. Spotting fake reviewer groups in consumer
reviews. In Proceedings of the 21st international conference on World Wide Web, pages 191–200.
ACM, 2012.
[14] Arjun Mukherjee, Abhinav Kumar, Bing Liu, Junhui Wang, Meichun Hsu, Malu Castellanos,
and Riddhiman Ghosh. Spotting opinion spammers using behavioral footprints. In Proceedings
of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining,
pages 632–640. ACM, 2013.
[15] Geli Fei, Arjun Mukherjee, Bing Liu, Meichun Hsu, Malu Castellanos, and Riddhiman Ghosh.
Exploiting burstiness in reviews for review spammer detection. In ICWSM. Citeseer, 2013.
[16] Xia Hu, Jiliang Tang, Huiji Gao, and Huan Liu. Social spammer detection with sentiment
information.
47