LECTURE 1: INTRODUCTION
TO DATA MINING
Dr. Dhaval Patel
CSE, IIT-Roorkee
What is data mining?
Data mining is also called knowledge discovery and
data mining (KDD)
Data mining is
extractionof useful patterns from data sources, e.g.,
databases, texts, web, image.
Patterns must be:
valid, novel, potentially useful, understandable
Knowledge Discovery in Data: Process
Data Mining Interpretation/
Evaluation
Knowledge
Patterns
Data
Knowledge Discovery in Data: Process
Knowledge Discovery in Data: Challenges
Volume
- Big Data
- Small Data
Data
Variety
Velocity - Transaction
- Data Stream - Temporal
- Static - Spatial
…
5
Outline (Part 1)
Introduction to Data
TransactionalData
Temporal Data
Spatial & Spatial-Temporal Data
Data Preprocessing
Missing
Values
Summarization
INTRODUCTION TO DATA
Data Come from Everywhere
Grocery Markets E-Commerce Stock Exchange
But, they have different form
Hospital Weather Station 8
Social Media
What is Data?
Attributes
Collection of records and their Tid Refund Marital Taxable
Status Income Cheat
attributes
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
An attribute is a characteristic of 4 Yes Married 120K No
an object 5 No Divorced 95K Yes
Objects 6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
A collection of attributes describe
9 No Married 75K No
an object
10 No Single 90K Yes
10
Types of Data
Record Data Graph Data
Transactional Data Transactional Data
Temporal Data UnStructured Data
Time Series Data
Twitter Status Message
Sequence Data
Review, news article
Spatial & Spatial-Temporal Semi-Structured Data
Data
Paper Publications Data
Spatial Data
XML format
Spatial-Temporal Data
Record Data
• Transaction Data
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Market-Basket Dataset
Data Matrix
If data objects have the same fixed set of numeric attributes,
then the data objects can be thought of as points in a multi-
dimensional space, where each dimension represents a distinct
attribute
Such data set can be represented by an m by n matrix, where
there are m rows, one for each object, and n columns, one for
each attribute
Data Matrix Example for Documents
Each document becomes a `term' vector,
each term is a component (attribute) of the vector,
the value of each component is the number of times the
corresponding term occurs in the document.
timeout
season
coach
game
score
team
ball
lost
pla
wi
n
y
Distance Matrix
3
point x y
2 p1
p1 0 2
p3 p4
1
p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
Temporal Data
Sequences Data
(Patient Data obtained from Zhang’s KDD 06 Paper)
Temporal Data
Time Series Data
Yahoo Finance Website
Biological Sequence Data
Interval Data
EL= { (A, 1, 5),( C, 3, 12), ( B, 4, 9), ( D, 9, 15) }
B
C
A
( ( (A overlaps C ) contains B ) overlaps D )
time
1 3 4 5 9 12 15
(Interval Patient Data obtained from Amit’s M.Tech. Thesis Work)
Spatial & Spatial-Temporal Data
• Spatial Data
(Spatial Distribution of Objects of Various Types : Prof. Shashi Shekhar)
Spatial & Spatial-Temporal Data
Spatial Data
Average Monthly Temperature of land and ocean
Spatial & Spatial-Temporal Data
Spatial Data
Dengue Disease Dataset (Singapore)
Spatial & Spatial-Temporal Data
Trajectory Data: Set of Harricans
http://csc.noaa.gov/hurricanes
Spatial & Spatial-Temporal Data
Trajectory Data: (of 87 users obtained using
RFID)
Vast 2008 Challenge – RFID Dataset
User Movement Data
Trajectory
Movement trail of a user
Sampling Points: <latitude, longitude, time>
Stadium
Movie Complex
Swimming Pool
P1 on weekends
Home
Thanks to Shreyash and Sahoishnu (M.Tech. Students)
Graph Data
Semi-structured Data
Unstructured Data
Data can help us solve specific problems.
How should these pictures be placed
into 3 groups?
How should these pictures be placed into groups?
How many groups should there be?
Which genes are associated with a disease? How can
expression values be used to predict survival?
What items should Amazon display for
me?
Is it likely that this stock was traded
based on illegal insider information?
Where are the faces in this picture?
Is this spam?
Will I like 300?
What techniques people apply on
data?
They apply data mining algorithms and discover useful
knowledge
So, what are the some of the well-known Data mining
Tasks?
Clustering,
Classification,
Frequent Patterns,
Association Rules,
….
What people do with the time series
data?
Clustering Classification
Motif Discovery Rule Query by
10 Content
Discovery
s = 0.5
c = 0.3
Visualization Novelty Detection Motif Association
What people do with the trajectory
data?
Clustering Frequent Travel Patterns
Motif Discovery Prediction
Visualization Classification
In, Summary
Types of Data Data Mining
Methods
Transactional Data Frequent Pattern
Sequence Data Discovery
Interval Data Classification
Time Series Data Algorithms Clustering
Spatial Data Outlier Detection
Spatio-Temporal Data Statistical Analysis
Data Set with Multiple …
Kinds of Data
….
Activity 1
Find top 3 recent research activities around the world
that are analyzing data. You need to write short
summary for each research activities. First three line
must follow following format:
Line 1: Problem they are trying to sole along with dataset
they are using
Line 2: How they are solving the problem
Line 3: Justify yourself why you rate this work as a top 5
activities
Remaining lines… you can think yourself ….
BigN’Smart Research group at IIT-Roorkee is analyzing “YelpReview”
Dataset for learning Location-to-activity Tagging. They are applying
… . I feel this is an interesting research because …
Activity 2: Why Data Mining ???
Google
Facebook
Netflix Read
eHarmony About
FICO Their
FlightCaster
Story
IBM’s Watson
Related Field
Machine Visualization
Learning
Data Mining and
Knowledge Discovery
Statistics Databases
43
Related Field
Statistics:
more theory-based
more focused on testing hypotheses
Machine learning
more heuristic
focused on improving performance of a learning agent
also looks at real-time learning and robotics – areas not part of data
mining
Data Mining and Knowledge Discovery
integrates theory and heuristics
focus on the entire process of knowledge discovery, including data cleaning,
learning, and integration and visualization of results
Distinctions are fuzzy
Classification
Learn a method for predicting the instance class from pre-labeled
(classified) instances
Many approaches: Statistics,
Decision Trees, Neural
Networks,
...
45
Clustering
Find “natural” grouping of instances given un-
labeled data
46
Association Rules & Frequent Itemsets
Transactions
Frequent Itemsets:
TID Produce
1 MILK, BREAD, EGGS Milk, Bread (4)
2 BREAD, SUGAR Bread, Cereal (3)
3 BREAD, CEREAL Milk, Bread, Cereal (2)
4 MILK, BREAD, SUGAR …
5 MILK, CEREAL
6 BREAD, CEREAL
7 MILK, CEREAL
8 MILK, BREAD, CEREAL, EGGS
9 MILK, BREAD, CEREAL
Rules:
Milk => Bread (66%)
47
Visualization & Data Mining
Visualizing the data to
facilitate human
discovery
Presenting the
discovered results in a
visually "nice" way
48
Summarization
Describe features of the selected
group
Use natural language and
graphics
Usually in Combination with
Deviation detection or other
methods
Average length of stay in this study area rose 45.7 percent,
from 4.3 days to 6.2 days, because ...
49
Data Mining Models and Tasks
Obtained from Prof. Srini’s Lecture notes