Mining Frequent Pattern
Asma Kanwal
Lecturer
What Is Frequent Pattern Analysis?
Frequent pattern: a pattern (a set of items, subsequences, substructures,
etc.) that occurs frequently in a data set
First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context
of frequent itemsets and association rule mining
Motivation: Finding inherent regularities in data
What products were often purchased together?— Beer and diapers?!
What are the subsequent purchases after buying a PC?
What kinds of DNA are sensitive to this new drug?
Can we automatically classify web documents?
Applications
Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis.
Why Is Freq. Pattern Mining Important?
Discloses an intrinsic and important property of data sets
Forms the foundation for many essential data mining tasks
Association, correlation, and causality analysis
Sequential, structural (e.g., sub-graph) patterns
Pattern analysis in spatiotemporal, multimedia, time-
series, and stream data
Classification: associative classification
Cluster analysis: frequent pattern-based clustering
Data warehousing: iceberg cube and cube-gradient
Semantic data compression: fascicles
Broad applications
Association Rule Mining
Given a set of transactions, find rules that will predict the occurrence
of an item based on the occurrences of other items in the transaction
Market-Basket transactions
Example of Association Rules
TID Items
{Diaper} {Beer},
1 Bread, Milk {Milk, Bread} {Eggs,Coke},
2 Bread, Diaper, Beer, Eggs {Beer, Bread} {Milk},
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Transaction data can be broadly interpreted I:
A set of documents…
• A text document data set. Each document is treated as a “bag” of
keywords. Note, text is ordered, but bags of word are not ordered
doc1: Student, Teach, School
doc2: Student, School
doc3: Teach, School, City, Game
doc4: Baseball, Basketball Example of Association Rules
doc5: Basketball, Player, Spectator
{Student} {School},
doc6: Baseball, Coach, Game, Team {data} {mining},
doc7: Basketball, Team, City, Game {Baseball} {ball},
Transaction data can be broadly interpreted II:
A set of genes
ID Expressed Genes in Sample
1 GENE1, GENE2, GENE 5
2 GENE1, GENE3, GENE 5
3 GENE2
4 GENE8, GENE9
5 GENE8, GENE9, GENE10
6 GENE2, GENE8
Example of Association Rules
7 GENE9, GENE10
8 GENE2
{GENE1} {GENE12},
9 GENE11 {GENE3, GENE12}
{GENE3},
Transaction data can be broadly interpreted
III: A set of time series patterns
1 A
B
2
A
C
Example of Association Rules
3 D
C {A} {B}
4
A
A
0 120 0 180
Use of Association Rules
Association rules do not represent any sort of causality or
correlation between the two itemsets.
X Y does not mean X causes Y, so no Causality
X Y can be different from Y X, unlike correlation
Association rule types:
Actionable Rules – contain high-quality, actionable information
Trivial Rules – information already well-known by those familiar with
the domain
Inexplicable Rules – no explanation and do not suggest action
Trivial and Inexplicable Rules occur most often
The Ideal Association Rule
Imagine that we have a large transaction dataset of patient
symptoms and interventions (including drugs taken).
We run our algorithm and it gives a rule that reads:
{warfarin, levofloxacin } {nose bleeds }
Then we have automatically discovered a dangerous drug
interaction. Both warfarin and levofloxacin are useful drugs by
themselves, but together they are dangerous… patterns of
bruises. Signs of an active bleed include: coughing up blood in
the form of coffee grinds (hemoptysis), gingival bleeding, nose
bleeds,….
Intuitive Association Rules
In the music recommendation domain:
{purchased(beatles LP)} {purchased(the kinks LP)}
These kinds of rules are very exploitable in ecommerce.
Definition: Frequent Itemset
Itemset
A collection of one or more items
Example: {Milk, Bread, Diaper} TID Items
k-itemset 1 Bread, Milk
An itemset that contains k items 2 Bread, Diaper, Beer, Eggs
Support count () 3 Milk, Diaper, Beer, Coke
4 Bread, Milk , Beer, Diaper
Frequency of occurrence of an
itemset 5 Bread, Milk, Diaper, Coke
E.g. ({Milk, Bread, Diaper}) = 2
Support (range from 0 to 1)
Fraction of transactions that contain
an itemset
E.g. s({Milk, Bread, Diaper}) = 2/5
Frequent Itemset
An itemset whose support is greater
than or equal to a minsup threshold
Definition: Association Rule
• Association Rule
– An implication expression of the form X Y, where X and Y are
itemsets*
– Example:
{Milk, Diaper} {Beer}
• Important Note
– Association rules do not consider order. So…
TID Items
– {Milk, Diaper} {Beer} 1 Bread, Milk
and 2 Bread, Diaper, Beer, Eggs
– {Diaper, Milk} {Beer} 3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
..are the same rule 5 Bread, Milk, Diaper, Coke
*X and Y are disjoint
Definition: Association Rule
• Association Rule
– An implication expression of the form X Y, where X and Y are
itemsets* TID Items
– Example: 1 Bread, Milk
{Milk, Diaper} {Beer} 2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
• Rule Evaluation Metrics 4 Bread, Milk, Diaper, Beer
– Support (s) 5 Bread, Milk, Diaper, Coke
• Fraction of transactions that contain both X and Y
– Confidence (c)
• Measures how often items in Y
appear in transactions that
contain X
Definition: Association Rule
• Association Rule
– An implication expression of
the form X Y, where X and Y TID Items
are itemsets*
1 Bread, Milk
– Example:
2 Bread, Diaper, Beer, Eggs
{Milk, Diaper} {Beer}
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
• Rule Evaluation Metrics
5 Bread, Milk, Diaper, Coke
– Support (s)
• Fraction of transactions that
Example:
contain both X and Y
– Confidence (c) {Milk, Diaper} Beer
• Measures how often items in Y
(Milk, Diaper, Beer) 2
s 5 0.4
appear in transactions that |T|
contain X
Definition: Association Rule
TID Items
• Association Rule
– An implication expression of 1 Bread, Milk
the form X Y, where X and Y 2 Bread, Diaper, Beer, Eggs
are itemsets* 3 Milk, Diaper, Beer, Coke
– Example: 4 Bread, Milk, Diaper, Beer
{Milk, Diaper} {Beer} 5 Bread, Milk, Diaper, Coke
• Rule Evaluation Metrics Example:
– Support (s) {Milk, Diaper} Beer
• Fraction of transactions that
contain both X and Y (Milk , Diaper, Beer) 2
– Confidence (c) s 5 0.4
|T|
• Measures how often items in Y
appear in transactions that (Milk, Diaper, Beer) 2
contain X
c 3 0.67
(Milk, Diaper)
Association Rules
• Why measure support?
– Very low support rules can happen by chance
– Even if true rules, low support rules are often not
actionable
• Why measure confidence?
– Very low confidence rules are not reliable
Association Rule Mining Task
Given a set of transactions T, the goal of association rule mining
is to find all rules having
support ≥ minsup threshold (provided by user)
confidence ≥ minconf threshold (provided by user)
Brute-force approach:
List all possible association rules
Compute the support and confidence for each rule
Prune rules that fail the minsup and minconf thresholds
Computationally prohibitive!
Mining Association Rules
Example of Rules:
TID Items
{Milk,Diaper} {Beer} (s=0.4, c=0.67)
1 Bread, Milk
{Milk,Beer} {Diaper} (s=0.4, c=1.0)
2 Bread, Diaper, Beer, Eggs
{Diaper,Beer} {Milk} (s=0.4, c=0.67)
3 Milk, Diaper, Beer, Coke {Beer} {Milk,Diaper} (s=0.4, c=0.67)
4 Bread, Milk, Diaper, Beer {Diaper} {Milk,Beer} (s=0.4, c=0.5)
5 Bread, Milk, Diaper, Coke {Milk} {Diaper,Beer} (s=0.4, c=0.5)
Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we can decouple the support and confidence requirements
Mining Association Rules
Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support minsup
2. Rule Generation
– Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent itemset
Frequent itemset generation is still computationally
expensive
The problem with association rules
How do we set support and confidence?
We tend to either find no rules, or a few million
Given we find a few million, we can rank them using some ranking
function….
There are lots of
measures
proposed in the
literature….
Basic Concepts: Frequent Patterns and
Association Rules
Transaction-id Items bought Itemset X = {x1, …, xk}
10 A, B, D Find all the rules X Y with minimum
20 A, C, D support and confidence
30 A, D, E support, s, probability that a
40 B, E, F transaction contains X Y
50 B, C, D, E, F
confidence, c, conditional
Customer Customer probability that a transaction
buys both buys diaper having X also contains Y
Let supmin = 50%, confmin = 50%
Freq. Pat.: {A:3, B:3, D:4, E:3,
AD:3}
Customer
Association rules:
buys beer
A D (60%, 100%)
D A (60%, 75%)
Closed Patterns and Max-Patterns
A long pattern contains a combinatorial number of sub-
patterns, e.g., {a1, …, a100} contains (1001) + (1002) + … +
(110000) = 2100 – 1 = 1.27*1030 sub-patterns!
Solution: Mine closed patterns and max-patterns instead
An itemset X is closed if X is frequent and there exists no
super-pattern Y כX, with the same support as X An itemset
X is a max-pattern if X is frequent and there exists no
frequent super-pattern Y כX (proposed by Bayardo @
SIGMOD’98)
Closed pattern is a lossless compression of freq. patterns
Reducing the # of patterns and rules
Closed Patterns and Max-Patterns
Exercise. DB = {<a1, …, a100>, < a1, …, a50>}
Min_sup = 1.
What is the set of closed itemset?
<a1, …, a100>: 1
< a1, …, a50>: 2
What is the set of max-pattern?
<a1, …, a100>: 1
Scalable Methods for Mining Frequent Patterns
The downward closure property of frequent patterns
Any subset of a frequent itemset must be frequent
If {beer, diaper, nuts} is frequent, so is {beer,
diaper}
i.e., every transaction having {beer, diaper, nuts} also
contains {beer, diaper}
Scalable mining methods: Three major approaches
Apriori
Freq. pattern growth
Vertical data format approach
Apriori: A Candidate Generation-and-Test Approach
Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not be generated/tested!
Method:
Initially, scan DB once to get frequent 1-itemset
Generate length (k+1) candidate itemsets from length
k frequent itemsets
Test the candidates against DB
Terminate when no frequent or candidate set can be
generated
The Apriori Algorithm—An Example
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup 2nd scan {A, B}
{A, C} 2
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
C3 Itemset L3 Itemset sup
3rd scan
{B, C, E} {B, C, E} 2
The Apriori Algorithm
Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in
Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
Important Details of Apriori
How to generate candidates?
Step 1: self-joining Lk
Step 2: pruning
How to count supports of candidates?
Example of Candidate-generation
L3={abc, abd, acd, ace, bcd}
Self-joining: L3*L3
abcd from abc and abd
acde from acd and ace
Pruning:
acde is removed because ade is not in L3
C4={abcd}
How to Generate Candidates?
Suppose the items in Lk-1 are listed in an order
Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 <
q.itemk-1
Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
How to Count Supports of Candidates?
Why counting supports of candidates a problem?
The total number of candidates can be very huge
One transaction may contain many candidates
Method:
Candidate itemsets are stored in a hash-tree
Leaf node of hash-tree contains a list of itemsets and
counts
Interior node contains a hash table
Subset function: finds all the candidates contained in a
transaction
Example: Counting Supports of
Candidates
Subset function
Transaction: 1 2 3 5 6
3,6,9
1,4,7
2,5,8
1+2356
13+56 234
567
145 345 356 367
136 368
357
12+356
689
124
457 125 159
458
Challenges of Frequent Pattern Mining
Challenges
Multiple scans of transaction database
Huge number of candidates
Tedious workload of support counting for
candidates
Improving Apriori: general ideas
Reduce passes of transaction database scans
Shrink number of candidates
Facilitate support counting of candidates
Partition: Scan Database Only
Twice
Any itemset that is potentially frequent in DB must be
frequent in at least one of the partitions of DB
Scan 1: partition database and find local frequent
patterns
Scan 2: consolidate global frequent patterns
Reduce the Number of Candidates
A k-itemset whose corresponding hashing bucket count is
below the threshold cannot be frequent
Candidates: a, b, c, d, e
Hash entries: {ab, ad, ae} {bd, be, de} …
Frequent 1-itemset: a, b, d, e
ab is not a candidate 2-itemset if the sum of count of
{ab, ad, ae} is below support threshold
Sampling for Frequent Patterns
Select a sample of original database, mine frequent
patterns within sample using Apriori
Scan database once to verify frequent itemsets found in
sample, only borders of closure of frequent patterns are
checked
Example: check abcd instead of ab, ac, …, etc.
Scan database again to find missed frequent patterns