Blog and Social Media Mining
What is a weblog / blog?
a (more or less) frequently updated
publication on the Web, sorted in (usually
reverse) chronological order of the constituent
blog posts.
The content may reflect any interest
including personal, journalistic or corporate.
Usually textual, but multimedia forms exist
(photoblog etc.)
Blog as an emerging new data
An Example of Blog Article
Location Info.
Blog Contents
The time stamp
Characteristics of blogs
Blog Article
Interlinking &
Forming communities
Highly personal
With opinions
Time
Location
Immediate response
to events
With mixed topics
Associated with time
& location
Blogs are Bursty
Blogs contain Theme Patterns
Theme life cycles
Discussion about Release of iPod Nano
in articles about iPod Nano
Strength
United States
China
Locations
Canada
09/20/05 09/26/05 Time
Discussion about Government Response in
articles about Hurricane Katrina
A theme snapshot
Existing Work on Weblog Analysis
Interlinking and Community
Analysis
Identifying communities
Monitoring the evolution and
bursting of communities
E.g., [Kumar et al. 2003]
# of nodes in communities
# of communities
Content Analysis
Blog level topic analysis
Information diffusion through
blogspace
Use topic bursting to predict
sales spikes
E.g., [Gruhl et al. 2005]
Blog mentions
Sales rank
Applications of Blog Theme Mining
Help answer questions like
Which country responded first to the release of iPod
Nano? China, UK, or Canada?
Did people in different states (e.g., Illinois vs. Texas)
respond differently/similarly to the increase of comodity
prices during Hurricane Katrina?
Potentially useful for
Summarizing topics
Monitoring public opinions
Business Intelligence
Consumers are more informed,
more demanding than ever
92% of
respondents said
they had more
confidence in
information they
seek out online
than anything
coming from a
salesclerk or other
source
Thats where our best consumers are!
A Large and Growing Population
Over 184 million people currently maintain a blog / are active in Soc. Media
...about 20% of the Internet population
Over 60% read post in blogs / Soc. Media
Trend Setters
74% with college degrees
42% have post-graduate degrees
That are More Diverse
58% over age of 35
51% of household incomes > $75K
Sharing Opinions and Ideas
Unaided, natural conversations
Rich in content
In Real Time
There are over a million new blog / Soc. Media posts every day
Source: Technorati, State of the Blogosphere, 2008 and Pew Research.
Analysis of text content has many
applications from tactical to strategic
Web
Intelligence
Product Innovation
Consumer Insight
Trend Insight
Brand Insight
Blogger Outreach
Buzz &
Sentiment
Crisis Communication
Tactical
Strategic
Strategic
If you have a crisis, definitely
use it to listen and reachout
June 21, 2005
Dell lies. Dell sucks.
I just got a new Dell laptop and paid a fortune for the four-year, in-home
service.
The machine is a lemon and the service is a lie.
I'm having all kinds of trouble with the hardware: overheats, network
doesn't work, maxes out on CPU usage. It's a lemon.
But what really irks me is that they say is they sent someone to my
home -- which I paid for -- he wouldn't have the parts, so I might as
well just send the machine in and lose it for 7-10 days -- plus the time
going through this crap. So I have this new machine and paid for
them to FUCKING FIX IT IN MY HOUSE and they don't and I lose it
for two weeks.
DELL SUCKS. DELL LIES. Put that in your Google and smoke it,
Facts and Opinions
Two main types of textual information.
Facts and Opinions
Most current information processing
techniques (e.g., search engines) work
with facts (assume they are true)
Facts can be expressed with topic
keywords
Opinions
In real life, facts are important, but opinion also
plays a crucial role. A computer manufacturer,
disappointed with low sales, asks itself: Why
arent consumers buying our laptop? A political
party, disappointed with the last election, wants
to know on an on-going basis: What is the
reaction in the press, newsgroups, chat rooms,
and blogs to latest policy decisions?
Opinions in posts
Analysis of Posts (Tasks)
Perform subjectivity and polarity classification
on blog posts
Discover irregularities in temporal mood
patterns (fear, excitement, etc) appearing in a
large corpus of posts
Use link polarity information to model trust
and influence in the blogosphere
Analyze sentiments about products and
correlate it with its sales
Challenges
Determine whether a document or portion
(e.g. paragraph or statement) is
subjective.
Example: the battery lasts 2 hours vs.
the battery lasts only 2 hours
Challenges
The difficulty lies in the richness of human
language use.
Example:
1. This is a great camera.
2. A great amount of money was spent for
promoting this camera.
3. One might think this is a great camera.
Well think again, because.....
a single keyword can be used to convey three
different opinions, +ve, neutral and -ve
respectively.
Challenges
In order to arrive at sensible conclusions,
sentiment analysis has to understand
context. For example, fighting and
disease is negative in a war context but
positive in a medical one.
Different mining conditions for different
domains.
Sentiment Classification
There are two main techniques for
sentiment classification:
The symbolic technique uses manually
crafted rules and lexicons,
The machine learning approach uses
unsupervised, or supervised learning to
construct a model from a large training
corpus.
Subjectivity
Find relevant words, phrases, patterns that
can be used to express subjectivity
Determine the polarity of subjective
expressions
Words
Adjectives
positive: honest important mature large patient
Ron Paul is the only honest man in Washington.
Kitchells writing is unbelievably mature and is only likely to
get better.
To humour me my patient father agrees yet again to my
choice of film
negative: harmful hypocritical inefficient insecure
It was a macabre and hypocritical circus.
Why are they being so inefficient ?
Words
Verbs
positive: praise, love
negative: blame, criticize
Nouns
positive: pleasure, enjoyment
negative: pain, criticism
Phrases
Phrases containing adjectives and
adverbs
positive: high intelligence, low cost, better
performance
negative: little variation, many troubles,
several excuses
Supervised Methods
In order to train a classifier for sentiment
recognition in text, classic supervised learning
techniques (e.g. Support Vector Machines, naive
Bayes, Maximum Entropy) can be used. A
supervised approach entails the use of a
labelled training corpus to learn a certain
classification function. Support Vector Machine
classifiers have been found to have the greatest
accuracy.
Unsupervised Learning
Clustering algorithms can be used to partition the
adjectives into two subsets
+
slow
scenic
nice
terrible
handsome
painful
fun
expensive
comfortable
Applications / Caselets
Sentiment Analysis for Mining
Marketing Intelligence
Sentiment Analysis for Mining Marketing Intelligence
This case study demonstrates the application of sentiment
analysis and opinion mining for extracting marketing
intelligence from online reviews
Different studies have confirmed the importance of online
reviews for consumers and product manufacturers
Users opinions expressed in reviews are important for
potential consumers to make well informed purchase
decisions
While, the same are needed by product manufacturers to gain
insights about their products strengths and weaknesses, and
to collect product benchmarking information
Marketing Intelligence
MI is the process of acquiring and analyzing information in
order to understand the market (both existing and potential
customers); to determine the current and future needs and
preferences, attitudes and behavior (Cornish, 1997)
In consonance with Cornishs definition, we take the view that
consumer sentiments and opinions can be useful for elicitation
of their preferences
Traditional Methods for Collecting Consumer
Preference
Typically, consumer preferences are estimated by means of
conjoint analysis of data from online or paper-and-pencil
surveys
However, this type of preference elicitation can easily become
expensive in terms of time and money
Moreover, the quality of the data resulting from surveys
depends on the willingness of the respondents to participate
in the study and the length (complexity) of the questionnaire
Data collection methods such as opinion polls, field interview
or purchasing costly point-of-sale data are found to be
expensive and time consuming
Objective
To discover marketing intelligence like Feature Buzz related to
products and to analyze feature level opinion by sentiment
analysis and opinion mining
Developing novel approaches for analysis of opinionated text
information by bridging the gap among text mining, machine
learning and natural language processing techniques
The Framework
Online Reviews Text Corpus
The study used online product reviews as the text corpus
The online reviews were collected from the Internet
The dataset was generated by collecting total 2,010 hotel
reviews for 102 hotels (11 popular travel destinations
in India) from Tripadvisor.com and Yatra.com
Credibility of Opinion Source-Online Reviews
TripAdvisor make up the largest travel community in the
world, with more than 60 million unique monthly visitors, and
over 75 million reviews and opinions (comScore Media
Report, 2012). (World's most trusted travel advice)
Yatra.com is Indias leading online travel website which is
recently voted Most Trusted Brand of India in the online
travel category by Brand Equity (CNBC Report, 2011)
Free text and user ratings format enables easy check of
content face validity
Each review undergoes genuine opinion checks and both sites
follow zero-tolerance policy on fake reviews
Preparation of Text Corpus
The download for hotel reviews was conducted during June
2012 to December 2012
The reviews were classified in terms of the overall sentiment
orientations and then divided to training and test datasets
Hotel reviews annotation- More than 3 stars rating as being
positive and less than 3 stars rating as being negative
Reviews with 3 stars (neutral) were discarded to restrict the
task to binary sentiment analysis
The Framework
Textual Pre-processing
The opinionated text documents were collected and then,
pre-processed to remove any non-textual information
The Vector Space Model (VSM) was adopted in order to
generate the bag of words for each document
Stemming was done to reduce words to their common
root or stem
Some of the stop words were removed but, we preserved
some useful sentiment expressing terms such as ok and
not
Top n-ranked terms were selected using Information Gain
feature selection
The Framework
Opinion Related Resource Generation
The opinion related resource generation involves identifying
product features (attributes), extracting the associated
opinions (positive or negative) and annotating text documents
for training the machine learning classifiers
Statistical patterns like frequent nouns, adjectives and other
phrases, association rules based frequent n-grams, manual
extraction rules, sentiment and domain knowledge
dictionaries can be used for extracting features and opinion
words
Rule-based Part-of-speech (POS) tagging was adopted for
identification of feature (as noun phrases) and opinion words
(Adj. and adverb) in the text
Example of Feature-Opinion Tuple Extraction Rules
Feature-Opinion Tuple Extraction
Redundancy pruning was done to remove non-candidate and
redundant features
Pointwise mutual information (PMI) based scores were used
to group features having similar meaning or co-occurring
features
Point mutual information, is a measure of association used
in information theory
Finally, phrase similarity was used to eliminate or merge
similar product features
The Framework
Sentiment based Classification
Feature-level sentiment analysis aims to find what
people like and dislike about a given object (Product
Feature)
Product review polarity classification involves discovering
whether
a
product
was
recommended/notrecommended in a review
We applied supervised machine learning based approach
for feature-level sentiment classification of online
reviews
Support vector machine (SVM) was the machine learning
model used
The Framework
Feature-level Opinion Mining
Product features are attributes that provide functionality to
products and play a crucial role in distinguishing similar
products of different brands
Feature-level opinion mining provides deep analysis of online
reviews by identifying different features of products that
consumers are concerned about
By mining product features and their associated opinion,
feature-level buzz monitoring and feature-level opinion
summarization can be done
Buzz - a term used in word-of-mouth marketing defined as a
vague but positive (may be negative on rare occasions)
association or anticipation about a product or service
Top 100 Frequent Features Extracted from 2000
Hotel Reviews
Feature Buzz with Top 30 Features in Online Hotel
Reviews
Overall Positive and Negative Sentiment Words in
Feature-Opinion Tuple
Feature-Opinion Tuple for Top 5 Features
Top 5 Features
Top 15 Positive Opinion Words
Top 15 Negative Opinion Words
Room
Clean (482), Good (370), Like (238), Nice (210), Comfort
(102), Better (81), Great (72), Excel (67), Big (62), Best
(59), Beautiful (50), Love (45), Decent (38), Worth (28),
Modern (26)
Small (139), Hot (92), Bad (82), Smell (60), Cold (52), Problem
(39), Poor (37), Stink (30), Costly (28), Worst (27), Damp (27),
Dark (26), Complain (18), Broken (18), Leak (15)
Food
Good (243), Excellent (76), Great (54), Tasty (47), Delight
(45), Like (44), Delicious (41), Enjoy (39), Nice (37), Decent
(30), Awesome (27), Best (26), Love (22), Fine (19), Better
(18)
Bad (104), Worst (78), Dislike (69), Wait (62), Cold (58),
Disappoint (49), Poor (42), Expensive (39), Horrible (39), Late
(33), Worse (31), Smell (27), Refuse (26), Complain (21), Pathetic
(17)
Good (94), Comfortable (67), Nice (52), Great (44), Enjoy
(30), Pleasant (30), Deluxe (27), Wonderful (20), Homestay
(20), Luxury (18), Memorable (16), Relax (15), Incredible
(14), Royal (12), Romantic (11)
Bad (57), Worst (43), Horrible (38), Disappoint (29), Problem
(23), Poor (22), Difficult (21), Disliked (20), Nightmare (18),
Costly (17), Expensive (14), Terrible (14), Pain (12), Avoid (11),
Mistake (11)
Good (38), Nice (28), Great (24), Reasonable (22),
Recommend (20), Worth (19), Decent (17), Budget (17),
Free (16), Standard (16), Fine (14), Quite (12), Ideal (11),
Best (10), Ok (10)
Expensive (38), Overpriced (28), Cost (25), Waste (21), Costly
(19), Fail (14), Joke (14), Limited (13), More (13), High (12),
Feel (10), Cheat (9), Poorly (8), Wrong (8), Con (7)
Stay
Experience
Price
Location
Near (52), Beautiful (35), Convenient (26), Good (22), Peace
Far (28), Distance (24), Away (23), Remote (21), Crowded (20),
(21), Agree (19), Easily (18), Walkable (16), Wait (16), Short
Lost (17), Problem (14), Issue (11), Out (11), Busy (10), Noisy
(14), Easy (12), Ideal (11), Lavish (10), Accessible (10),
(10), Mislead (8), Bitter (8), Hectic (7), Long (7)
Popular (8)
Summary
The study has demonstrated methods for
automatically extracting consumer opinions from
online reviews of hotels
It has shown that aggregated consumer sentiment
as well as specific opinion about product features
can be extracted using sentiment analysis
techniques
More Advanced:
Spatiotemporal Theme Mining
Given a collection of posted articles about a topic with
time and location information
Discover multiple themes (i.e., subtopics) being discussed in
these articles
For a given location, discover how each theme evolves over
time (generate a theme life cycle)
For a given time, reveal how each theme spreads over
locations (generate a theme snapshot)
Compare theme life cycles in different locations
Compare theme snapshots in different time periods
Challenges in
Spatiotemporal Theme Mining
How to represent a theme?
How to model the themes in a collection?
How to model their dependency on time and
location?
How to compute the theme life cycles and
theme snapshots?
All these must be done in an unsupervised
way
How?
Time-stamped data sets of weblogs, each about one
event (broad topic):
Data Set
# docs
Time Span(2005)
Query
Katrina
9377
08/16 -10/04
Hurricane Katrina
Rita
1754
08/16 - 10/04
Hurricane Rita
iPod Nano
1720
09/02 - 10/26
iPod Nano
Extract location information from author profiles
Isolate by location
On each data set, we extract a set of salient themes
and their life cycles / theme snapshots
Theme Life Cycles for iPod Nano
United States
China
Release of Nano
Canada
United Kingdom
ipod 0.2875
nano 0.1646
apple 0.0813
september 0.0510
mini 0.0442
screen 0.0242
new 0.0200
Applications / Caselets
Identifying the Target Segment
CASE STUDY | Enabling your passionates
COMPANY BACKGROUND
Maker of pruning sheers
for gardening and scissors
for crafts
NEED
Wanted to build a
marketing campaign to
recruit brand advocates
into an online
community
ASSUMPTIONS
Knew Boomer Females
were great target for
sewing and crafts
Surprising findings
SOLUTION
Baseline read for
online chatter
Identify
demographics
FINDINGS
Found that Gen Y
females were
actually the right
target
AND, big issue
was online
crafters could be
mean
Adjusting the game plan
RESULTS
Adjusted strategy for
new demographics
and new voices
Created ambassador
program which has
helped grow
Fiskateers to more
than 6,000 active
members
Members invite others
In first 3 months,
increased online
mentions by 341%
Sales grew by 20%
Applications / Caselets
Trend and Segmentation Analysis
Are Consumers Buying Green?
2007
2008
160%
De
c
Se
p
O
ct
No
v
Ju
l
Au
g
M
ar
Ap
r
M
ay
Ju
n
Ju
l
Au
g
Se
p
O
ct
No
v
De
c
Ja
n
Fe
b
M
ar
Ap
r
M
ay
Ju
n
Ja
n
Fe
b
Trend analysis
156,177
98,148
71,882
51,638
37,944
Early 2007 was dominated by the Negators and the
I just dont know what to thinkcrowd
ACTION
Negator
22%
Social
Activist
9%
Personal
Shifter
8%
AGREEMENT
DISAGREEMENT
Rejecter
14%
Uncertain
24%
Idler
5%
Skeptic
12%
Guilty
6%
Apathetic
(not measured)
INACTION
By late 2007, momentum had swung to
agreement
ACTION
Negator
17%
Activist
10%
Social
Personal
Shifter
16%
AGREEMENT
DISAGREEMENT
Rejecter
12%
Uncertain
9%
Idler
13%
Skeptic
11%
Guilty
14%
Apathetic
(not measured)
INACTION
Concern about the environment continued
to gain momentum in early 2008
ACTION
Social
Negator
14%
Activist
8%
Personal
DISAGREEMENT
AGREEMENT
Shifter
19%
Rejecter
8%
Uncertain
10%
Idler
15%
Skeptic
13%
Guilty
13%
Apathetic
(not measured)
INACTION
By 2010, more than 7 out of 10 were concerned, and
almost half were actively doing something about it
ACTION
Social
Negator
3%
Activist
18%
Personal
DISAGREEMENT
AGREEMENT
Shifter
27%
Rejecter
5%
Uncertain
10%
Skeptic
10%
Idler
21%
Apathetic
Guilty
6%
(not measured)
INACTION