Introduction to Text Mining
Hongning Wang
CS@UVa
Who Am I?
• Hongning Wang
– Assistant professor in CS@UVa since August 2014
– Research areas
• Information retrieval
• Data mining
• Machine learning
CS@UVa CS6501: Text Mining 2
Who Am I?
• Hongning Wang
– Assistant professor in CS@UVa since August 2014
CS@UVa CS6501: Text Mining 3
What Am I Doing at UVa?
• Sentiment analysis with topic modeling
CS@UVa CS6501: Text Mining 4
What Am I Doing at UVa?
• Interactive online recommendation
– Modeling recommendation as a two-party game
Strategy?
Goal: Challenge:
1) Unknown preference;
2) Feedback is acquired on the
fly, and it is not free!
CS@UVa CS6501: Text Mining 5
What Am I Doing at UVa?
• Yahoo frontpage news recommendation
18,882 users, 188,384 articles, and 9,984,879 logged
events segmented into 1,123,583 sessions.
CS@UVa CS6501: Text Mining 6
What Am I Doing at UVa?
• Personalization techniques raise serious public
concerns about privacy infringement
No means for users to opt-out data collection!
CS@UVa CS6501: Text Mining 7
What Am I Doing at UVa?
• Privacy-preserving personalization
Stronger privacy guarantee than k-anonymity
CS@UVa CS6501: Text Mining 8
What are about you?
• Why do you choose this course?
• Anything specific you want me to know?
• What type of text data do you often
encounter in your projects?
• What kind of knowledge do you want to
extract from it?
CS@UVa CS6501: Text Mining 9
What is “Text Mining”?
• “Text mining, also referred to as text data
mining, roughly equivalent to text analytics,
refers to the process of deriving high-quality
information from text.” - wikipedia
• “Another way to view text data mining is as a
process of exploratory data analysis that
leads to heretofore unknown information, or
to answers for questions for which the answer
is not currently known.” - Hearst, 1999
CS@UVa CS6501: Text Mining 10
Two different definitions of mining
• Goal-oriented (effectiveness driven)
– Any process that generates useful results that are non-
obvious is called “mining”.
– Keywords: “useful” + “non-obvious”
– Data isn’t necessarily massive
• Method-oriented (efficiency driven)
– Any process that involves extracting information from
massive data is called “mining”
– Keywords: “massive” + “pattern”
– Patterns aren’t necessarily useful
CS@UVa CS6501: Text Mining 11
Knowledge discovery from text data
• IBM’s Watson wins at Jeopardy! - 2011
CS@UVa CS6501: Text Mining 12
An overview of Watson
CS@UVa CS6501: Text Mining 13
What is inside Watson?
• “Watson had access to 200 million pages of
structured and unstructured content consuming
four terabytes of disk storage including the full
text of Wikipedia” – PC World
• “The sources of information for Watson include
encyclopedias, dictionaries, thesauri, newswire
articles, and literary works. Watson also used
databases, taxonomies, and ontologies.
Specifically, DBPedia, WordNet, and Yago were
used.” – AI Magazine
CS@UVa CS6501: Text Mining 14
What is inside Watson?
• DeepQA system
– “Watson's main innovation was not in the creation
of a new algorithm for this operation but rather its
ability to quickly execute hundreds of proven
language analysis algorithms simultaneously to
find the correct answer.” – New York Times
– The DeepQA Research Team
CS@UVa CS6501: Text Mining 15
Text mining around us
• Sentiment analysis
CS@UVa CS6501: Text Mining 16
Text mining around us
• Sentiment analysis
CS@UVa CS6501: Text Mining 17
Text mining around us
• Document summarization
CS@UVa CS6501: Text Mining 18
Text mining around us
• Document summarization
CS@UVa CS6501: Text Mining 19
Text mining around us
• Movie recommendation
CS@UVa CS6501: Text Mining 20
Text mining around us
• Restaurant/hotel recommendation
CS@UVa CS6501: Text Mining 21
Text mining around us
• News recommendation
CS@UVa CS6501: Text Mining 22
Text mining around us
• Text analytics in financial services
CS@UVa CS6501: Text Mining 23
Text mining around us
• Text analytics in healthcare
CS@UVa CS6501: Text Mining 24
How to perform text mining?
• As computer scientists, we view it as
– Text Mining = Data Mining + Text Data
CS@UVa CS6501: Text Mining 25
Text mining v.s. NLP, IR, DM…
• How does it relate to data mining in general?
• How does it relate to computational
linguistics?
• How does it relate to information retrieval?
Finding Patterns Finding “Nuggets”
Novel Non-Novel
General Database
Non-textual data
data-mining Exploratory queries
data analysis Information
Textual data Text Mining
Computational
Linguistics retrieval
CS@UVa CS6501: Text Mining 26
Text mining in general
Serve for IR Sub-area of Mining
Access applications DM research
Filter Discover knowledge
information
Based on NLP/ML Add
techniques Organization Structure/Annotations
CS@UVa CS6501: Text Mining 27
Challenges in text mining
• Data collection is “free text”
– Data is not well-organized
• Semi-structured or unstructured
– Natural language text contains ambiguities on many levels
• Lexical, syntactic, semantic, and pragmatic
– Learning techniques for processing text typically need
annotated training examples
• Expensive to acquire at scale
• What to mine?
CS@UVa CS6501: Text Mining 28
Text mining problems we will solve
• Lexical semantics and word senses
– Identifying which sense of a word (i.e. meaning) is
used in a sentence, when the word has multiple
meanings
CS@UVa CS6501: Text Mining 29
Text mining problems we will solve
• Document categorization
– Adding structures to the text corpus
CS@UVa CS6501: Text Mining 30
Text mining problems we will solve
• Text clustering
– Identifying structures in the text corpus
CS@UVa CS6501: Text Mining 31
Text mining problems we will solve
• Topic modeling
– Identifying structures in the text corpus
CS@UVa CS6501: Text Mining 32
Text mining problems we will solve
• Social media and social network analysis
– Exploring additional structure in the text corpus
CS@UVa CS6501: Text Mining 33
We will also briefly cover
• Natural language processing pipeline
– Tokenization
• “Studying text mining is fun!” -> “studying” + “text” +
“mining” + “is” + “fun” + “!”
– Part-of-speech tagging
• “Studying text mining is fun!” ->
– Dependency parsing
• “Studying text mining is fun!” ->
CS@UVa CS6501: Text Mining 34
We will also briefly cover
• Machine learning techniques
– Supervised methods
• Naïve Bayes, k Nearest Neighbors, Logistic Regression
– Unsupervised methods
• K-Means, hierarchical clustering, topic models
– Semi-supervised methods
• Expectation Maximization
CS@UVa CS6501: Text Mining 35
Text mining in the era of Big Data
• Huge in size
– Google processes 5.13B queries/day (2013)
– Twitter receives 340M tweets/day (2012)
– Facebook has 2.5 PB of user data + 15 TB/day (4/2009)
– eBay has 6.5 PB of user data + 50 TB/day640K
(5/2009)
ought to be
enough for anybody.
• 80% data is unstructured (IBM, 2010)
CS@UVa CS6501: Text Mining 36
Scalability is crucial
• Large scale text processing techniques
– MapReduce framework
CS@UVa CS6501: Text Mining 37
State-of-the-art solutions
• Apache Spark (spark.apache.org)
– In-memory MapReduce
• Specialized for machine learning algorithms
– Speed
• 100x faster than Hadoop MapReduce in memory, or
10x faster on disk.
CS@UVa CS6501: Text Mining 38
State-of-the-art solutions
• Apache Spark (spark.apache.org)
– In-memory MapReduce
• Specialized for machine learning algorithms
– Generality
• Combine SQL, streaming, and complex analytics
CS@UVa CS6501: Text Mining 39
State-of-the-art solutions
• GraphLab (graphlab.com)
– Graph-based, high performance, distributed
computation framework
CS@UVa CS6501: Text Mining 40
State-of-the-art solutions
• GraphLab (graphlab.com)
– Specialized for sparse data with local
dependencies for iterative algorithms
CS@UVa CS6501: Text Mining 41
Text mining in the era of Big Data
Human-generated data
Knowledge Discovery
Text data Behavior data
Knowledge service system
As knowledge As data producer
consumer
Challenges: Challenges:
1. Implicit feedback Human: big data producer and consumer 1. Unstructured data
2. Diverse and dynamic 2. Rich semantic
CS@UVa CS6501: Text Mining 42
Text books
• Introduction to Information Retrieval.
Christopher D. Manning, Prabhakar Raghavan,
and Hinrich Schuetze, Cambridge University
Press, 2007.
• Speech and Language Processing. Daniel
Jurafsky and James H. Martin, Pearson Education,
2000.
• Mining Text Data. Charu C. Aggarwal and
ChengXiang Zhai, Springer, 2012.
CS@UVa CS6501: Text Mining 43
What to read?
Applications
Algorithms
Web Applications,
Machine Learning Bioinformatics…
Pattern Recognition
Statistics ICML, NIPS, UAI
Optimization Library & Info
Data Mining
Text Mining Science
KDD, ICDM, SDM
NLP
ACL, EMNLP, COLING Information Retrieval
SIGIR, WWW, WSDM, CIKM
• Find more on course website for resource
CS@UVa CS6501: Text Mining 44
Welcome to the class of “Text Mining”!
CS@UVa CS6501: Text Mining 45