Mining Frequent Patterns,
Association and
Correlations: Basic
Concepts and Methods
1
Topics
Basic Concepts
Frequent Itemset Mining Methods
Which Patterns Are Interesting?—Pattern
Evaluation Methods
Summary
2
What Is Frequent Pattern Analysis?
Frequent pattern:
a pattern that occurs frequently in a data set
(a set of items (frequent itemsets), subsequences, substructures, etc.)
Motivation:
Finding inherent regularities in data
What products were often purchased together?— Beer and diapers?!
What are the subsequent purchases after buying a PC?
Can we automatically classify web documents?
Applications:
Basket data analysis, cross-marketing, sale campaign analysis, Web
log (click stream) analysis
3
Why Is Freq. Pattern Mining Important?
Freq. pattern: An intrinsic and important property of
datasets
Foundation for many essential data mining tasks
Association, correlation, and causality analysis
Sequential, structural (e.g., sub-graph) patterns
Pattern analysis in stream data
Classification: discriminative, frequent pattern analysis
Broad applications
4
Problem Definition:
The problem of association rule mining is defined as:
5
Example 1:
The set of items is
and a small database containing the items in transactions is shown in
the table
(1 codes presence and 0 absence of an item in a transaction)
An example rule for the supermarket could be
meaning that if butter and bread are bought, customers also buy milk
6
Example:
Example database with 4 items and 5 transactions
7
Example:
Important concepts of Association Rule Mining:
Three techniques:
SUPPORT
CONFIDENCE
LIFT
support(item) = transactions of item/total transactions
support, s, Supp(X ^ Y) is
probability that a transaction contains combined sales: X ^ Y
In the example database, above, find Supp(X ^ Y)
Solution:
note that the item X ^ Y is
Supp(X ^ Y)= 1/5=0.2
it occurs in 20% of all transactions (1 out of 5 transactions).
8
Example:
confidence, c, conf(X ^ Y) is
It is been calculated for whether the product sales are popular
on individual sales or through combined sales
conf(X ^ Y)= Supp(X ^ Y) / Supp(X)
In the example database, above, find conf(X ^ Y) or
conf ( ) or
Conf (X=>Y)
Solution:
note that the item X ^ Y is
Supp(X ^ Y)= 1/5=0.2
note that the item X is
Supp(X)= 1/5=0.2
since it occurs in 20% of all transactions (1 out of 5 transactions).
conf(X ^ Y)= 0.2/0.2=1=100% 9
Example:
In the example database, above, find lift (X=>Y)
For another rule :
Lift is calculated for knowing the ratio for the sales
lift(X=>Y)= Supp(X ^ Y) / Supp(X) x Supp(Y)
Solution:
note that the item X ^ Y is
Supp(X ^ Y)= 1/5=0.2
note that the item X is
Item y is
Therefore,
lift (X=>Y) = 0.2/ 0.4x0.4= 1.25
10
Example:
When the Lift value is above 1 (as shown in this case) ,
it shows that the probability of buying both the things together is
high when compared to the transaction for the individual items
sold
When the Lift value is below 1 means the combination is not so
frequently bought by consumers
11
Basic Concepts: Frequent Patterns
Ti Items bought support count of X:
d
10 Beer, Nuts, Diaper
Frequency or occurrence of an
20 Beer, Coffee, Diaper
itemset X
30 Beer, Diaper, Eggs
40 Nuts, Eggs, Milk (relative) support, s,
50 Nuts, Coffee, Diaper, Eggs, is the fraction of transactions
Milk
that contains X (i.e., the
Custom Customer probability that a transaction
er buys contains X)
buys diaper
both
An itemset X is frequent
if X’s support is no less than a
minsup threshold
Customer
buys beer
12
Example 2: Association Rules
Tid Items bought Find all the rules X Y with
10 Beer, Nuts, Diaper
minimum support and confidence
20 Beer, Coffee, Diaper
30 Beer, Diaper, Eggs Let minsup = 50%, minconf = 50%
40 Nuts, Eggs, Milk Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3,
50 Nuts, Coffee, Diaper, Eggs, Milk {Beer, Diaper}:3
Customer
buys both
Customer For the association rules shown
buys below:
Beer Diaper (60%, 100%)
diaper
Diaper Beer (60%, 75%)
Customer
Find
buys beer supp (Beer Diaper )
conf (Beer Diaper )
supp (Diaper Beer )
conf (Diaper Beer ) 13
Mining Frequent Patterns, Association and
Correlations: Basic Concepts and Methods
Basic Concepts
Frequent Itemset Mining Methods
Which Patterns Are Interesting?—Pattern
Evaluation Methods
Summary
14
Scalable Frequent Itemset Mining Methods
Apriori: A Candidate Generation-and-Test
Approach
Improving the Efficiency of Apriori
FPGrowth: A Frequent Pattern-Growth Approach
15
The Downward Closure Property and Scalable
Mining Methods
The downward closure property of frequent patterns
Any subset of a frequent itemset must be frequent
If {beer, diaper, nuts} is frequent, so is {beer,
diaper}
i.e., every transaction having {beer, diaper, nuts} also
contains {beer, diaper}
Scalable mining methods:
Apriori (Agrawal & Srikant@VLDB’94)
Freq. pattern growth (FPgrowth—Han, Pei & Yin
@SIGMOD’00)
16
Apriori: A Candidate Generation & Test Approach
Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not be generated/tested!
(Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94)
Method:
Initially, scan DB once to get frequent 1-itemset
Generate length (k+1) candidate itemsets from length
k frequent itemsets
Test the candidates against DB
Terminate when no frequent or candidate set can be
generated
17
Apriori: A Candidate Generation & Test Approach
The name of the algorithm is based on the fact that the
algorithm uses prior knowledge of frequent item set
properties.
Apriori employs an iterative approach known as a level-wise
search, where k-itemsets are used to explore (k+1)-item
sets.
18
Apriori: A Candidate Generation & Test Approach
First, the set of frequent 1-itemsets is found by scanning
the database to accumulate the count for each item, and
collecting those items that satisfy minimum support.
The resulting set is denoted L1.
Next, L1 is used to find L2, the set of frequent 2-
itemsets, which is used to find L3, and so on, until no
more frequent k-item sets can be found.
The finding of each Lk requires one full scan of the
database.
A two-step process is followed in Apriori consisting of
join and prune action.
19
The Apriori Algorithm (Pseudo-Code)
20
The Apriori Algorithm (Pseudo-Code)
21
Example 3:
22
Example:
There are nine transactions in this database, that is, |D| = 9.
Steps:
1. In the first iteration of the algorithm, each item is a member
of the set of candidate1- item sets, C1. The algorithm simply
scans all of the transactions in order to count the number of
occurrences of each item.
2. Suppose that the minimum support count required is 2, that
is, min sup = 2. The set of frequent 1-itemsets, L1, can then be
determined. It consists of the candidate 1-itemsets satisfying
minimumsupport. In our example, all of the candidates in C1
satisfy minimum support.
23
Example:
Steps:
3. To discover the set of frequent 2-itemsets, L2, the algorithm
uses the join L1 on L1 to generate a candidate set of 2-
itemsets, C2. No candidates are removed from C2 during the
prune step because each subset of the candidates is also
frequent.
4. Next, the transactions in D are scanned and the support
count of each candidate itemset In C2 is accumulated.
5. The set of frequent 2-itemsets, L2, is then determined,
consisting of those candidate2- item sets in C2 having minimum
support.
24
Example:
Steps:
6. The generation of the set of candidate 3-itemsets,C3, From
the join step, we first get C3 =L2x L2 = ({I1, I2, I3}, {I1,
I2, I5}, {I1, I3, I5}, {I2, I3, I4},{I2, I3, I5}, {I2, I4, I5}.
Based on the Apriori property that all subsets of a frequent
item set must also be frequent, we can determine that the
four latter candidates cannot possibly be frequent.
7. The transactions in D are scanned in order to determine L3,
consisting of those candidate 3-itemsets in C3 having minimum
support.
25
Example:
26
Example:
27
Scalable Frequent Itemset Mining Methods
Apriori: A Candidate Generation-and-Test Approach
FPGrowth: A Frequent Pattern-Growth Approach
28
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
The two primary drawbacks of the Apriori Algorithm are:
1. At each step, candidate sets have to be built.
2. To build the candidate sets, the algorithm has to repeatedly scan the
database.
These two properties inevitably make the algorithm slower
To overcome these redundant steps,
a new association-rule mining algorithm was developed named Frequent
Pattern Growth Algorithm
It overcomes the disadvantages of the Apriori algorithm by storing all
the transactions in a Tree Data Structure
29
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
Consider the following data:-
30
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
The frequency of each individual item is computed:-
31
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
Let the minimum support be 3
A Frequent Pattern set is built which will contain all the elements whose
frequency is greater than or equal to the minimum support
These elements are stored in descending order of their respective
frequencies
After insertion of the relevant items, the set L looks like this:-
L = {K : 5, E : 4, M : 3, O : 4, Y : 3}
32
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
Now, for each transaction, the respective Ordered-Item set is built
It is done by iterating the Frequent Pattern set and
checking if the current item is contained in the transaction in question
If the current item is contained, the item is inserted in the Ordered-Item set
for the current transaction
The following table is built for all the transactions:
33
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
Now, all the Ordered-Item sets are
inserted into a Tree Data Structure.
a) Inserting the set {K, E, M, O, Y}:
Here, all the items are simply linked one
after the other in the order of occurrence
in the set and initialize the support count
for each item as 1.
34
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
Now, all the Ordered-Item sets are
inserted into a Tree Data Structure.
b) Inserting the set {K, E, O, Y}:
Till the insertion of the elements K and E,
simply the support count is increased by 1
On inserting O, a new node for the item O
is initialized with the support count as 1
and item E is linked to this new node.
On inserting Y, we first initialize a new
node for the item Y with support count as
1 and link the new node of O with the new
35
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
Now, all the Ordered-Item sets are
inserted into a Tree Data Structure.
c) Inserting the set {K, E, M}:
Here simply the support count of each
element is increased by 1.
36
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
Now, all the Ordered-Item sets are
inserted into a Tree Data Structure.
d) Inserting the set {K, M, Y}:
Similar to step b),
first the support count of K is
increased, then
new nodes for M and Y are
initialized and linked accordingly.
37
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
Now, all the Ordered-Item sets are
inserted into a Tree Data Structure.
e) Inserting the set {K, E, O}:
Here simply the support counts of the
respective elements are increased.
Note that the support count of the
new node of item O is increased.
38
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
Now, for each item, the Conditional Pattern Base is computed which is
path labels of all the paths which lead to any node of the given item in
the frequent-pattern tree.
Note that the items in the below table are arranged in the ascending order
of their frequencies.
39
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
Now for each item, the Conditional Frequent Pattern Tree is built.
It is done by taking the set of elements that is common in all the paths in
the Conditional Pattern Base of that item and
calculating its support count by summing the support counts of all the paths
in the Conditional Pattern Base.
40
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
From the Conditional Frequent Pattern tree,
the Frequent Pattern rules are generated by
pairing the items of the Conditional Frequent Pattern Tree set to
the corresponding item as given in the below table.
For each row, two types of association rules can be inferred
for example,
for the first row, the rules K -> Y and Y -> K can be inferred.
To determine the valid rule,
the confidence of both the rules is calculated and
the one with confidence greater than or equal to the minimum confidence value is retained.
41
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
We re-examine the mining of transaction database, D, of Table in
Example 3 (modified) using the frequent pattern growth approach.
The first scan of the database is the same as Apriori, which derives the
set of frequent items (1-itemsets) and their support counts
(frequencies).
Let the minimum support count be 2.
The set of frequent items is sorted in the order of descending support
count.
42
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
TID List of Item IDs
T100 I2, I1, I5
T200 I2, I4
T300 I2, I3
T400 I2, I1, I4
T500 I2, I3
T600 I2, I3
T700 I2, I3
T800 I2, I1, I3, I5
T900 I2, I1, I3
43
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
An FP-tree is then constructed as follows.
First, create the root of the tree, labeled with “null.”
Scan database D a second time.
The items in each transaction are processed in L order (i.e., sorted
according to descending support count), and a branch is created for
each transaction.
For example, the scan of the first transaction, “T100: I1, I2, I5,” which
contains three items (I2, I1, I5 in L order),
leads to the construction of the first branch of the tree with three nodes,
hI2: 1i, hI1:1i, and hI5: 1i,
where I2 is linked as a child of the root, I1 is linked to I2, and I5 is
linked to I1.
44
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
An FP-tree is then constructed as follows (CONTD…)
The second transaction, T200, contains the items I2 and I4 in L order,
which would result in a branch where I2 is linked to the root and I4 is
linked to I2.
However, this branch would share a common prefix, I2, with the existing
path for T100.
Therefore, we instead increment the count of the I2 node by 1, and
create a new node, hI4: 1i,which is linked as a child of hI2:2i.
In general, when considering the branch to be added for a transaction,
the count of each node along a common prefix is incremented by 1, and
nodes for the items following the prefix are created and linked to
facilitate tree traversal, an item header table is built so that each item
points to its occurrences in the tree via a chain of node-links. 45
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
An FP-tree is then constructed as follows (CONTD…)
The tree obtained after scanning all of the transactions is shown in
Figure (below) with the associated node-links.
In this way, the problem of mining frequent patterns in databases is
transformed to that of mining the FP-tree
46
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
Mining the FP tree
The FP-tree is mined as follows.
Start from each frequent length-1 pattern (as an initial suffix pattern),
construct its conditional pattern base (a “sub database,” which consists
of the set of prefix paths in the FP-tree co-occurring with the suffix
pattern),
then construct its (conditional) FP-tree, and
perform mining recursively on such a tree.
The pattern growth is achieved by the concatenation of the suffix pattern
with the frequent patterns generated from a conditional FP-tree.
47
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
We first consider I5, which is the last item in L, rather than the first.
The reason for starting at the end of the list will become apparent as we
explain the FP-tree mining process.
I5 occurs in two branches of the FP-tree of Figure 5.7.
(The occurrences of I5 can easily be found by following its chain of
node-links.)
The paths formed by these branches are I2, I1, I5: 1 and I2, I1, I3, I5: 1.
48
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
Therefore, considering I5 as a suffix, its corresponding two prefix paths are
I2, I1: 1 and I2, I1, I3: 1,
which form its conditional pattern base.
Its conditional FP-tree contains only a single path, I2: 2, I1: 2;
The single path generates all the combinations of frequent patterns:
I2, I5: 2,
I1, I5: 2,
I2, I1, I5: 2.
49
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
50
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
Bottlenecks of the Apriori approach
Breadth-first (i.e., level-wise) search
Candidate generation and test
Often generates a huge number of candidates
The FPGrowth Approach (J. Han, J. Pei, and Y. Yin, SIGMOD’ 00)
Depth-first search
Avoid explicit candidate generation
Major philosophy: Grow long patterns from short ones using local
frequent items only
“abc” is a frequent pattern
Get all transactions having “abc”, i.e., project DB on abc: DB|abc
“d” is a local frequent item in DB|abc abcd is a frequent pattern
51
Example 2:
Construct FP-tree from a Transaction Database shown below and a
summary of the mining of FP tree
TID Items bought (ordered) frequent items
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b} min_support = 3
400 {b, c, k, s, p} {c, b, p}
Header Table
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
Item frequency head
f 4
c 4
a 3
b 3
m 3
p 3
52
Database and a summary of the mining of FP
tree
TID Items bought (ordered) frequent items
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m} min_support = 3
300 {b, f, h, j, o, w} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
{}
1. Scan DB once, find Header Table
frequent 1-itemset (single
item pattern) Item frequency head f:4 c:1
f 4
2. Sort frequent items in c 4 c:3 b:1 b:1
frequency descending a 3
order, f-list b 3 a:3 p:1
m 3
3. Scan DB again, construct p 3
FP-tree m:2 b:1
F-list = f-c-a-b-m-p
p:2 m:1
53
Partition Patterns and Databases
Frequent patterns can be partitioned into subsets
according to f-list
F-list = f-c-a-b-m-p
Patterns containing p
Patterns having m but no p
…
Patterns having c but no a nor b, m, p
Pattern f
Completeness and non-redundency
54
Find Patterns Having P From P-conditional Database
Starting at the frequent item header table in the FP-tree
Traverse the FP-tree by following the link of each frequent item p
Accumulate all of transformed prefix paths of item p to form p’s
conditional pattern base
{}
Header Table
f:4 c:1 Conditional pattern bases
Item frequency head
f 4 item cond. pattern base
c 4 c:3 b:1 b:1 c f:3
a 3
b 3 a:3 p:1 a fc:3
m 3 b fca:1, f:1, c:1
p 3 m:2 b:1 m fca:2, fcab:1
p:2 m:1 p fcam:2, cb:1
55
From Conditional Pattern-bases to Conditional FP-trees
For each pattern-base
Accumulate the count for each item in the base
Construct the FP-tree for the frequent items of the
pattern base
m-conditional pattern base:
{} fca:2, fcab:1
Header Table
Item frequency head All frequent
f:4 c:1 patterns relate to m
f 4 {}
c 4 c:3 b:1 b:1 m,
a 3 f:3 fm, cm, am,
b 3 a:3 p:1 fcm, fam, cam,
m 3 c:3 fcam
p 3 m:2 b:1
p:2 m:1 a:3
m-conditional FP-tree
56
The Frequent Pattern Growth Mining Method
Idea: Frequent pattern growth
Recursively grow frequent patterns by pattern and
database partition
Method
For each frequent item, construct its conditional
pattern-base, and then its conditional FP-tree
Repeat the process on each newly created conditional
FP-tree
Until the resulting FP-tree is empty, or it contains only
one path—single path will generate all the
combinations of its sub-paths, each of which is a
frequent pattern
57
Mining Frequent Patterns, Association and
Correlations: Basic Concepts and Methods
Basic Concepts
Frequent Itemset Mining Methods
Which Patterns Are Interesting?—Pattern
Evaluation Methods
Summary
58
Interestingness Measure: Correlations (Lift)
We could easily end up with thousands or even millions of patterns, many
of which might not be interesting.
It is therefore important to establish a set of well-accepted criteria for
evaluating the quality of association patterns to identify the most
interesting ones
Limitations of the Support-Confidence Framework
Example. Suppose we are interested in analyzing the relationship between
people who drink tea and coffee. We may gather information about the
beverage preferences among a group of people and summarize their
responses into a table such as the one shown in Table (below).
59
Interestingness Measure: Correlations (Lift)
The information given in this table can be used to evaluate the
association rule {Tea} −→ {Coffee}.
At first glance, it may appear that people who drink tea also tend to
drink coffee because the rule’s support (15%) and confidence (75%)
values are reasonably high.
60
Interestingness Measure: Correlations (Lift)
This argument would have been acceptable except that the fraction of
people who drink coffee, regardless of whether they drink tea, is
80%, while the fraction of tea drinkers who drink coffee is only 75%
Thus knowing that a person is a tea drinker actually decreases her
probability of being a coffee drinker from 80% to 75%!
The rule {Tea} −→{Coffee} is therefore misleading despite its high
confidence value.
One way to address this problem is by applying a metric known as
lift: =
61
Interestingness Measure: Correlations (Lift)
Lift:
play basketball eat cereal [40%, 66.7%] is misleading
The overall % of students eating cereal is 75% > 66.7%.
play basketball not eat cereal [20%, 33.3%] is more accurate,
although with lower support and confidence
Measure of dependent/correlated events: lift
Basketball Not Sum
P( A B) basketball (row)
lift Cereal 2000 1750 3750
P( A) P( B)
2000 / 5000 Not cereal 1000 250 1250
lift ( B, C ) 0.89
3000 / 5000 * 3750 / 5000 Sum(col.) 3000 2000 5000
1000 / 5000
lift ( B, C ) 1.33
3000 / 5000 *1250 / 5000
62
Interestingness Measure: Kulczynski
63
Interestingness Measure: Kulczynski
64
Interestingness Measure: IR (Imbalance Ratio)
IR (Imbalance Ratio): measure the imbalance of two
itemsets A and B in rule implications
EXAMPLE:
Find IR (m,c) for D4, D5, and D6.
Interestingness Measure: IR (Imbalance Ratio)
EXAMPLE (Solution):
Imbalance Ratio (IR) presents a clear picture for all the three
datasets D4 through D6
D4 is balanced
D5 is imbalanced
D6 is very imbalanced