Introduction to Data
Mining
Madava Viranjan
• The world is rich in data
• Repositories to store data from multiple heterogeneous data sources
• OLAP as analysis technique with functionalities like summarization,
consolidation and aggregation.
What is Data Mining?
• The process of discovering interesting patterns and knowledge from large
amount of data
• Does it same as Knowledge Discovery from Data (KDD)?
KDD vs
Data Mining
Data Mining Functionalities
• Class/Concept Description
• Classes and Concepts can be described in summarized terms
• Mining Frequent Patterns
• Patterns that occur frequently in a dataset
• Classification
• Find a model that describes and distinguishes classes/concepts
• Cluster Analysis
• Objects are grouped to maximize intra-class similarity but minimize
inter-class similarities
• Are all patterns interesting?
• Can Data Mining system generate all of the interesting patterns?
• Can Data Mining system generate only required patterns?
It is a
Combination
of Subjects
Mining Frequent
Patterns
Frequent Patterns
• Frequent patterns are patterns that appear frequently in data set. Could be
either frequent itemset, frequent sequence or frequent substructure.
• Mining frequent patterns leads to discover interesting associations and
correlations in data
Frequent Itemset Mining
• Market Basket Analysis
• Typical example of
frequent itemset mining
Mining Frequent Itemsets – Apriori
Algorithm
• It uses prior knowledge of frequent itemset to determine level wise
frequent itemsets.
• Apriori property
• All non empty subsets of a frequent itemset must also be frequent
• Minimum Support Threshold
• At least frequencies should be satisfy minimum support
Mining Frequent Itemsets – Apriori
Algorithm Contd.
TID List of item_id
T1 i1, i2, i5
T2 i2, i4
T3 i2, i3
T4 i1, i2, i4
T5 i1, i3
T6 i2, i3
T7 i1, i3
T8 i1, i2, i3, i5
T9 i1, i2, i3
Minimum Support = 2
Mining Frequent Itemsets – Apriori
Algorithm Contd.
TID Computer Webcam Antivirus Office Suite SDCard
Software
T1 1 1 1 0 0
T2 0 1 1 1 0
T3 0 0 0 1 1
T4 1 1 0 1 0
T5 1 1 1 0 1
T6 1 1 1 1 1
Minimum Support = 50%
Mining Frequent Itemsets – Apriori
Algorithm Contd
• step1 : create 1-itemset, C1
• step2: by considering min_support get the frequent 1-itemset, L1
• step3: join L1 with L1(same) and create candidate 2-itemset, C2
• step4: by considering min_support get the frequent 2-itemset, L2
• step5: join L2 with L2(same) and create candidate 3-itemset. Remove
itemsets which does not satisfy appriori property.
• step6: by considering min_support get the frequent 3-itemset, L3
Mining Frequent Itemsets – Apriori
Algorithm Contd.
• How to compute confidence?
{i1, i2}=>i5
{i1, i5}=>i2
{i2, i5}=>i1
i1=>{i2, i5}
I2=>{i1, i5}
Problems of Apriori Mining
• Need to generate huge number of candidate sets
• Need to scan whole database repeatedly
Mining Frequent Itemsets – A Pattern
Growth Approach
TID List of item_id
T1 i1, i2, i5
T2 i2, i4
T3 i2, i3
• Divide and conquer approach T4 i1, i2, i4
• Create a Frequent Pattern tree (FP- T5 i1, i3
Tree)
T6 i2, i3
T7 i1, i3
T8 i1, i2, i3, i5
T9 i1, i2, i3
Mining Frequent Itemsets – A Pattern
Growth Approach contd.
step1 : Derives the 1-itemset(similar to Apriori)
step2: Create list ‘L’ by oredering 1-itemset in descending order
step3: Create the root of FP-tree and labeled as ‘null’
step4: Scan the database and again and in each transaction add a branch
based on the same order as ‘L’
Mining Frequent Itemsets – A Pattern
Growth Approach contd.
• When mining start from each length-1 pattern and construct its conditional
pattern base. Then construct its conditional FP tree and do this in recursive
manner.
TID Items
1 {a, b}
2 {b, c, d}
3 {a, c, d, e}
4 {a, d, e}
5 {a, b, c}
6 {a, b, c, d}
7 {a}
8 {a, b, c}
9 {a, b, d}
10 {b, c, e}
Minimum Support = 2
• Association rule can be misleading
Total number of transactions = 10000
Buys computer games = 6000
Buys videos = 7500
Buys both = 4000
Min_sup = 30%
Min_confidence = 60%
Correlation Analysis
• Other than measuring support and confidence correlation between
itemsets being considered.
Correlation Analysis with Lift Measure
• Lift is a measure which used in Correlation Analysis
• If the result is less than 1 then A is negatively correlated with B