0% found this document useful (0 votes)

158 views45 pages

Introduction to Text Mining Course

This document introduces text mining and provides an overview of the topic from an instructor of a text mining course. It discusses what text mining is, examples of text mining applications, challenges in text mining, and techniques that will be covered in the course like document categorization and topic modeling.

Uploaded by

Rudra Gandhi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

158 views45 pages

Introduction to Text Mining Course

Uploaded by

Rudra Gandhi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 45

Introduction to Text Mining

Hongning Wang
CS@UVa
Who Am I?
• Hongning Wang
– Assistant professor in CS@UVa since August 2014
– Research areas
• Information retrieval
• Data mining
• Machine learning

CS@UVa CS6501: Text Mining 2

Who Am I?
• Hongning Wang
– Assistant professor in CS@UVa since August 2014

CS@UVa CS6501: Text Mining 3

What Am I Doing at UVa?
• Sentiment analysis with topic modeling

CS@UVa CS6501: Text Mining 4

What Am I Doing at UVa?
• Interactive online recommendation
– Modeling recommendation as a two-party game
Strategy?

Goal: Challenge:
1) Unknown preference;
2) Feedback is acquired on the
fly, and it is not free!

CS@UVa CS6501: Text Mining 5

What Am I Doing at UVa?
• Yahoo frontpage news recommendation

18,882 users, 188,384 articles, and 9,984,879 logged

events segmented into 1,123,583 sessions.

CS@UVa CS6501: Text Mining 6

What Am I Doing at UVa?
• Personalization techniques raise serious public
concerns about privacy infringement
No means for users to opt-out data collection!

CS@UVa CS6501: Text Mining 7

What Am I Doing at UVa?
• Privacy-preserving personalization
Stronger privacy guarantee than k-anonymity

CS@UVa CS6501: Text Mining 8

What are about you?
• Why do you choose this course?
• Anything specific you want me to know?
• What type of text data do you often
encounter in your projects?
• What kind of knowledge do you want to
extract from it?

CS@UVa CS6501: Text Mining 9

What is “Text Mining”?
• “Text mining, also referred to as text data
mining, roughly equivalent to text analytics,
refers to the process of deriving high-quality
information from text.” - wikipedia
• “Another way to view text data mining is as a
process of exploratory data analysis that
leads to heretofore unknown information, or
to answers for questions for which the answer
is not currently known.” - Hearst, 1999

CS@UVa CS6501: Text Mining 10

Two different definitions of mining
• Goal-oriented (effectiveness driven)
– Any process that generates useful results that are non-
obvious is called “mining”.
– Keywords: “useful” + “non-obvious”
– Data isn’t necessarily massive
• Method-oriented (efficiency driven)
– Any process that involves extracting information from
massive data is called “mining”
– Keywords: “massive” + “pattern”
– Patterns aren’t necessarily useful

CS@UVa CS6501: Text Mining 11

Knowledge discovery from text data
• IBM’s Watson wins at Jeopardy! - 2011

CS@UVa CS6501: Text Mining 12

An overview of Watson

CS@UVa CS6501: Text Mining 13

What is inside Watson?
• “Watson had access to 200 million pages of
structured and unstructured content consuming
four terabytes of disk storage including the full
text of Wikipedia” – PC World
• “The sources of information for Watson include
encyclopedias, dictionaries, thesauri, newswire
articles, and literary works. Watson also used
databases, taxonomies, and ontologies.
Specifically, DBPedia, WordNet, and Yago were
used.” – AI Magazine

CS@UVa CS6501: Text Mining 14

What is inside Watson?
• DeepQA system
– “Watson's main innovation was not in the creation
of a new algorithm for this operation but rather its
ability to quickly execute hundreds of proven
language analysis algorithms simultaneously to
find the correct answer.” – New York Times
– The DeepQA Research Team

CS@UVa CS6501: Text Mining 15

Text mining around us
• Sentiment analysis

CS@UVa CS6501: Text Mining 16

Text mining around us
• Sentiment analysis

CS@UVa CS6501: Text Mining 17

Text mining around us
• Document summarization

CS@UVa CS6501: Text Mining 18

Text mining around us
• Document summarization

CS@UVa CS6501: Text Mining 19

Text mining around us
• Movie recommendation

CS@UVa CS6501: Text Mining 20

Text mining around us
• Restaurant/hotel recommendation

CS@UVa CS6501: Text Mining 21

Text mining around us
• News recommendation

CS@UVa CS6501: Text Mining 22

Text mining around us
• Text analytics in financial services

CS@UVa CS6501: Text Mining 23

Text mining around us
• Text analytics in healthcare

CS@UVa CS6501: Text Mining 24

How to perform text mining?
• As computer scientists, we view it as
– Text Mining = Data Mining + Text Data

CS@UVa CS6501: Text Mining 25

Text mining v.s. NLP, IR, DM…
• How does it relate to data mining in general?
• How does it relate to computational
linguistics?
• How does it relate to information retrieval?
Finding Patterns Finding “Nuggets”
Novel Non-Novel

General Database
Non-textual data
data-mining Exploratory queries
data analysis Information
Textual data Text Mining
Computational
Linguistics retrieval
CS@UVa CS6501: Text Mining 26
Text mining in general
Serve for IR Sub-area of Mining
Access applications DM research
Filter Discover knowledge
information

Based on NLP/ML Add

techniques Organization Structure/Annotations
CS@UVa CS6501: Text Mining 27
Challenges in text mining
• Data collection is “free text”
– Data is not well-organized
• Semi-structured or unstructured
– Natural language text contains ambiguities on many levels
• Lexical, syntactic, semantic, and pragmatic
– Learning techniques for processing text typically need
annotated training examples
• Expensive to acquire at scale
• What to mine?

CS@UVa CS6501: Text Mining 28

Text mining problems we will solve
• Lexical semantics and word senses
– Identifying which sense of a word (i.e. meaning) is
used in a sentence, when the word has multiple
meanings

CS@UVa CS6501: Text Mining 29

Text mining problems we will solve
• Document categorization
– Adding structures to the text corpus

CS@UVa CS6501: Text Mining 30

Text mining problems we will solve
• Text clustering
– Identifying structures in the text corpus

CS@UVa CS6501: Text Mining 31

Text mining problems we will solve
• Topic modeling
– Identifying structures in the text corpus

CS@UVa CS6501: Text Mining 32

Text mining problems we will solve
• Social media and social network analysis
– Exploring additional structure in the text corpus

CS@UVa CS6501: Text Mining 33

We will also briefly cover
• Natural language processing pipeline
– Tokenization
• “Studying text mining is fun!” -> “studying” + “text” +
“mining” + “is” + “fun” + “!”
– Part-of-speech tagging
• “Studying text mining is fun!” ->
– Dependency parsing
• “Studying text mining is fun!” ->

CS@UVa CS6501: Text Mining 34

We will also briefly cover
• Machine learning techniques
– Supervised methods
• Naïve Bayes, k Nearest Neighbors, Logistic Regression
– Unsupervised methods
• K-Means, hierarchical clustering, topic models
– Semi-supervised methods
• Expectation Maximization

CS@UVa CS6501: Text Mining 35

Text mining in the era of Big Data
• Huge in size
– Google processes 5.13B queries/day (2013)
– Twitter receives 340M tweets/day (2012)
– Facebook has 2.5 PB of user data + 15 TB/day (4/2009)
– eBay has 6.5 PB of user data + 50 TB/day640K
(5/2009)
ought to be
enough for anybody.
• 80% data is unstructured (IBM, 2010)

CS@UVa CS6501: Text Mining 36

Scalability is crucial
• Large scale text processing techniques
– MapReduce framework

CS@UVa CS6501: Text Mining 37

State-of-the-art solutions
• Apache Spark (spark.apache.org)
– In-memory MapReduce
• Specialized for machine learning algorithms
– Speed
• 100x faster than Hadoop MapReduce in memory, or
10x faster on disk.

CS@UVa CS6501: Text Mining 38

State-of-the-art solutions
• Apache Spark (spark.apache.org)
– In-memory MapReduce
• Specialized for machine learning algorithms
– Generality
• Combine SQL, streaming, and complex analytics

CS@UVa CS6501: Text Mining 39

State-of-the-art solutions
• GraphLab (graphlab.com)
– Graph-based, high performance, distributed
computation framework

CS@UVa CS6501: Text Mining 40

State-of-the-art solutions
• GraphLab (graphlab.com)
– Specialized for sparse data with local
dependencies for iterative algorithms

CS@UVa CS6501: Text Mining 41

Text mining in the era of Big Data

Human-generated data

Knowledge Discovery
Text data Behavior data
Knowledge service system

As knowledge As data producer

consumer
Challenges: Challenges:
1. Implicit feedback Human: big data producer and consumer 1. Unstructured data
2. Diverse and dynamic 2. Rich semantic
CS@UVa CS6501: Text Mining 42
Text books
• Introduction to Information Retrieval.
Christopher D. Manning, Prabhakar Raghavan,
and Hinrich Schuetze, Cambridge University
Press, 2007.
• Speech and Language Processing. Daniel
Jurafsky and James H. Martin, Pearson Education,
2000.
• Mining Text Data. Charu C. Aggarwal and
ChengXiang Zhai, Springer, 2012.

CS@UVa CS6501: Text Mining 43

What to read?
Applications
Algorithms
Web Applications,
Machine Learning Bioinformatics…
Pattern Recognition
Statistics ICML, NIPS, UAI
Optimization Library & Info
Data Mining
Text Mining Science
KDD, ICDM, SDM
NLP
ACL, EMNLP, COLING Information Retrieval
SIGIR, WWW, WSDM, CIKM

• Find more on course website for resource

CS@UVa CS6501: Text Mining 44
Welcome to the class of “Text Mining”!

CS@UVa CS6501: Text Mining 45

BDA Module-5b Text Mining
No ratings yet
BDA Module-5b Text Mining
23 pages
Lecture 5 - Text Mining Sentiment and Social Media Analytics
No ratings yet
Lecture 5 - Text Mining Sentiment and Social Media Analytics
52 pages
Business Intelligence and Data Mining: by Dr. Atanu Rakshit Email: Atanu - Rakshit@iimrohtak - Ac.in
No ratings yet
Business Intelligence and Data Mining: by Dr. Atanu Rakshit Email: Atanu - Rakshit@iimrohtak - Ac.in
122 pages
Text and Web Analytics
No ratings yet
Text and Web Analytics
48 pages
05b.BDA (18CS72) Module-5 Text Mining
No ratings yet
05b.BDA (18CS72) Module-5 Text Mining
23 pages
CH 06 PPTaccessible
No ratings yet
CH 06 PPTaccessible
71 pages
Week 3 Text, Web, and Social Media Analytics
No ratings yet
Week 3 Text, Web, and Social Media Analytics
58 pages
Business Intelligence, Analytics, and Data Science: A Managerial Perspective
No ratings yet
Business Intelligence, Analytics, and Data Science: A Managerial Perspective
73 pages
02.MOUDLE 5 - Text Mining
No ratings yet
02.MOUDLE 5 - Text Mining
27 pages
Text Mining
No ratings yet
Text Mining
25 pages
43.IJCSCN PreprocessingTechniquesforTextMining Ilamathi Nithya
No ratings yet
43.IJCSCN PreprocessingTechniquesforTextMining Ilamathi Nithya
11 pages
Unit Ii DM
No ratings yet
Unit Ii DM
18 pages
3510-6510 Ch5
No ratings yet
3510-6510 Ch5
73 pages
Text Data Management and Analysis PDF
100% (3)
Text Data Management and Analysis PDF
531 pages
Week 12
No ratings yet
Week 12
19 pages
Text Mining & Applications in Social Media: by Anthony Yang
No ratings yet
Text Mining & Applications in Social Media: by Anthony Yang
30 pages
Lecture 6-Text Mining and Sentiment Analysis
No ratings yet
Lecture 6-Text Mining and Sentiment Analysis
57 pages
Text Analysis Pipelines
No ratings yet
Text Analysis Pipelines
36 pages
Live Classroom 3
No ratings yet
Live Classroom 3
36 pages
CS423 Data Warehousing and Data Mining: Dr. Hammad Afzal
No ratings yet
CS423 Data Warehousing and Data Mining: Dr. Hammad Afzal
31 pages
WINSEM2023-24 BCSE206L TH VL2023240501787 2024-02-19 Reference-Material-I
No ratings yet
WINSEM2023-24 BCSE206L TH VL2023240501787 2024-02-19 Reference-Material-I
42 pages
Introduction To Text Mining
No ratings yet
Introduction To Text Mining
82 pages
Module 3 - DSV
No ratings yet
Module 3 - DSV
17 pages
AFM - Module 4
No ratings yet
AFM - Module 4
48 pages
Week10 Social Network Analytics
No ratings yet
Week10 Social Network Analytics
19 pages
Text and Web Mining
No ratings yet
Text and Web Mining
44 pages
Text Mining: Lecturer: Dr. Nguyen Thi Ngoc Anh
No ratings yet
Text Mining: Lecturer: Dr. Nguyen Thi Ngoc Anh
27 pages
Text Data Management
No ratings yet
Text Data Management
39 pages
IT445 Week8 Ch7
No ratings yet
IT445 Week8 Ch7
59 pages
Chapter 03 - Sharda 11e Full Accessible PPT 07
No ratings yet
Chapter 03 - Sharda 11e Full Accessible PPT 07
29 pages
1 Text Mining Review Slides
No ratings yet
1 Text Mining Review Slides
78 pages
Section 2 Text Analytics and Text Mining Overview
No ratings yet
Section 2 Text Analytics and Text Mining Overview
47 pages
Screenshot 2024-06-04 at 12.02.17 AM
No ratings yet
Screenshot 2024-06-04 at 12.02.17 AM
23 pages
Text Mining: 2 History
No ratings yet
Text Mining: 2 History
8 pages
CS317 IR W1a
No ratings yet
CS317 IR W1a
20 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
37 pages
Module 1 Part1
No ratings yet
Module 1 Part1
54 pages
Mapping Texts 2024
No ratings yet
Mapping Texts 2024
326 pages
Mapping Texts - Computational Text Analysis For The Social - Dustin S - Stoltz, Marshall A - Taylor - 2024 - Computational Social Science - 9780197756874 - Anna's Archive
100% (1)
Mapping Texts - Computational Text Analysis For The Social - Dustin S - Stoltz, Marshall A - Taylor - 2024 - Computational Social Science - 9780197756874 - Anna's Archive
326 pages
Applications of NLP
No ratings yet
Applications of NLP
85 pages
Text Mining PPT Merged
100% (1)
Text Mining PPT Merged
58 pages
Search Engines - Text Mining in Action
No ratings yet
Search Engines - Text Mining in Action
18 pages
Data Science & Text Mining Guide
No ratings yet
Data Science & Text Mining Guide
39 pages
Bcse206l FDS Module-4 Smsatapathy
No ratings yet
Bcse206l FDS Module-4 Smsatapathy
50 pages
Lect 5
No ratings yet
Lect 5
40 pages
Text Analytics and Text Mining Overview
No ratings yet
Text Analytics and Text Mining Overview
16 pages
Data Mining Techniques Guide
No ratings yet
Data Mining Techniques Guide
61 pages
L5 - L6 - Natural Language Processing
100% (1)
L5 - L6 - Natural Language Processing
94 pages
Unit V - Web and Text Mining
No ratings yet
Unit V - Web and Text Mining
35 pages
Text Mining in Big Data Analytics
No ratings yet
Text Mining in Big Data Analytics
34 pages
What Is Text Mining
No ratings yet
What Is Text Mining
9 pages
Lec 5 e Text Analytics Vector Space TF IDF
No ratings yet
Lec 5 e Text Analytics Vector Space TF IDF
51 pages
Data Mining for Business Experts
No ratings yet
Data Mining for Business Experts
41 pages
Applied Text Mining
100% (1)
Applied Text Mining
505 pages
Text Mining: Tools, Techniques, and Applications
No ratings yet
Text Mining: Tools, Techniques, and Applications
19 pages
Text Mining Assignment
No ratings yet
Text Mining Assignment
12 pages
Prof. Mohammed Tanzeem Agra
No ratings yet
Prof. Mohammed Tanzeem Agra
33 pages
Module 9
No ratings yet
Module 9
30 pages
Ep 20 Units
No ratings yet
Ep 20 Units
142 pages
Organophosphate Insecticides (OPC)
No ratings yet
Organophosphate Insecticides (OPC)
27 pages
BZ-08-062-F Forklift Handover Checklist Form
No ratings yet
BZ-08-062-F Forklift Handover Checklist Form
2 pages
Respect FocusedTherapy CH 1
100% (1)
Respect FocusedTherapy CH 1
15 pages
The World During Rizal's Time PDF
No ratings yet
The World During Rizal's Time PDF
29 pages
Existentialist Feminism and Simone de Beauvoir PDF
No ratings yet
Existentialist Feminism and Simone de Beauvoir PDF
2 pages
Span 210-MW Syllabus Spring 2014
No ratings yet
Span 210-MW Syllabus Spring 2014
12 pages
Cornerstones of Financial Accounting 3rd Canadian Edition Rich Unlocked Test Bank
No ratings yet
Cornerstones of Financial Accounting 3rd Canadian Edition Rich Unlocked Test Bank
311 pages
Physics EE Subject Guide
No ratings yet
Physics EE Subject Guide
9 pages
Lab Report: Submitted To
No ratings yet
Lab Report: Submitted To
6 pages
Sunny Days For Silicon
No ratings yet
Sunny Days For Silicon
5 pages
Ec PDF
No ratings yet
Ec PDF
1,602 pages
BestSub Heat Press Catalog 2024
No ratings yet
BestSub Heat Press Catalog 2024
37 pages
Cambridge IGCSE: FRENCH 0520/03
No ratings yet
Cambridge IGCSE: FRENCH 0520/03
18 pages
Article 130153
No ratings yet
Article 130153
8 pages
Reoi Construction Supervision Services Leseru-Kitale Morpus-Lokichar - 28.3.2025
100% (1)
Reoi Construction Supervision Services Leseru-Kitale Morpus-Lokichar - 28.3.2025
3 pages
ES Alcoholic Beverages
No ratings yet
ES Alcoholic Beverages
10 pages
Revelations of Chance Synchronicity As Spiritual Experience No-Wait Download
100% (8)
Revelations of Chance Synchronicity As Spiritual Experience No-Wait Download
14 pages
CSEC Biology June 2014 P032
No ratings yet
CSEC Biology June 2014 P032
12 pages
Role of Family in Consumer Behaviour
0% (1)
Role of Family in Consumer Behaviour
10 pages
WiFi, Working, Elements of WiFi
100% (2)
WiFi, Working, Elements of WiFi
67 pages
MySQL Backup & Recovery Basics
No ratings yet
MySQL Backup & Recovery Basics
15 pages
PeriUrja Company Profile
No ratings yet
PeriUrja Company Profile
10 pages
Flipkart Sample Opposition
100% (1)
Flipkart Sample Opposition
76 pages
Meaning and Discourse: Dr. Manjet Kaur Dr. Omer Mahfoodh
No ratings yet
Meaning and Discourse: Dr. Manjet Kaur Dr. Omer Mahfoodh
59 pages
Chest Freezer: User Manual
No ratings yet
Chest Freezer: User Manual
31 pages
Bhairahawa Engineering and Builders PVT - LTD: Core Contract Documents CLIENT: .
No ratings yet
Bhairahawa Engineering and Builders PVT - LTD: Core Contract Documents CLIENT: .
5 pages
en - GASP 2020 2022 Global Aviation Safety Plan
No ratings yet
en - GASP 2020 2022 Global Aviation Safety Plan
144 pages
Official Transcript of Competencies
No ratings yet
Official Transcript of Competencies
2 pages
CONTOH SKRIPSI (Analisa Penempatan Shear Wall)
No ratings yet
CONTOH SKRIPSI (Analisa Penempatan Shear Wall)
61 pages

Introduction to Text Mining Course

Uploaded by

Introduction to Text Mining Course

Uploaded by

Introduction to Text Mining

CS@UVa CS6501: Text Mining 2

CS@UVa CS6501: Text Mining 3

CS@UVa CS6501: Text Mining 4

CS@UVa CS6501: Text Mining 5

18,882 users, 188,384 articles, and 9,984,879 logged

CS@UVa CS6501: Text Mining 6

CS@UVa CS6501: Text Mining 7

CS@UVa CS6501: Text Mining 8

CS@UVa CS6501: Text Mining 9

CS@UVa CS6501: Text Mining 10

CS@UVa CS6501: Text Mining 11

CS@UVa CS6501: Text Mining 12

CS@UVa CS6501: Text Mining 13

CS@UVa CS6501: Text Mining 14

CS@UVa CS6501: Text Mining 15

CS@UVa CS6501: Text Mining 16

CS@UVa CS6501: Text Mining 17

CS@UVa CS6501: Text Mining 18

CS@UVa CS6501: Text Mining 19

CS@UVa CS6501: Text Mining 20

CS@UVa CS6501: Text Mining 21

CS@UVa CS6501: Text Mining 22

CS@UVa CS6501: Text Mining 23

CS@UVa CS6501: Text Mining 24

CS@UVa CS6501: Text Mining 25

Based on NLP/ML Add

CS@UVa CS6501: Text Mining 28

CS@UVa CS6501: Text Mining 29

CS@UVa CS6501: Text Mining 30

CS@UVa CS6501: Text Mining 31

CS@UVa CS6501: Text Mining 32

CS@UVa CS6501: Text Mining 33

CS@UVa CS6501: Text Mining 34

CS@UVa CS6501: Text Mining 35

CS@UVa CS6501: Text Mining 36

CS@UVa CS6501: Text Mining 37

CS@UVa CS6501: Text Mining 38

CS@UVa CS6501: Text Mining 39

CS@UVa CS6501: Text Mining 40

CS@UVa CS6501: Text Mining 41

As knowledge As data producer

CS@UVa CS6501: Text Mining 43

• Find more on course website for resource

CS@UVa CS6501: Text Mining 45

You might also like