Unit-6
Association Rule Mining
Introduction
Many business enterprises accumulate large quantities of
data from their day-to-day operations.
For example, Grocery stores/Retail stores
Market basket transactions
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Introduction
Data required to learn about the purchasing behavior of
customers.
Useful for marketing promotions, inventory management,
and customer relationship management.
Association analysis, is useful for discovering interesting
relationships hidden in large data sets.
Relationships are represented as Association rules or set
of frequent items.
{Diapers} ~ {Beer}
The purchasing of one product when another product is
purchased represents an association rule.
Market Basket Analysis
One basket tells you about what
one customer purchased at one
time.
A loyalty card makes it possible
to tie together purchases by a
single customer (or household)
over time
Market Basket Analysis
Retail – each customer purchases different set of products,
different quantities, different times
Retailers use this information to:
Identify who customers are (not by name)
Understand why they make certain purchases
Gain insight about its merchandise (products)
• Fast and slow movers
• Products which are purchased together
• Products which might benefit from promotion
Take action:
• Store layouts
• Which products to put on specials, promote, coupons…
Combining all of this with a customer loyalty card it becomes
even more valuable
Market Basket Analysis
Association rules can be applied on other types of “baskets.”
Items purchased on a credit card, such as rental cars and hotel rooms,
provide insight into the next product that customers are likely to
purchase,
Optional services purchased by telecommunications customers (call
waiting, call forwarding, DSL, speed call, and so on) help determine
how to bundle these services together to maximize revenue.
Banking products used by retail customers (money market accounts,
certificate of deposit, investment services, car loans, and so on)
identify customers likely to want other products.
Unusual combinations of insurance claims can be a sign of fraud and
can spark further investigation.
Medical patient histories can give indications of likely complications
based on certain combinations of treatments.
What is Association Rule Mining
Given a set of transactions, find rules that will predict the
occurrence of an item based on the occurrences of other items
in the transaction
Market-Basket transactions Example of Association Rules
TID Items
{Diaper} {Beer},
1 Bread, Milk {Milk, Bread} {Eggs,Coke},
2 Bread, Diaper, Beer, Eggs {Beer, Bread} {Milk},
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
How can Association rules be used?
What is Association Rule Mining
Rule form
Antecedent Consequent [support, confidence]
(support and confidence are user defined measures of interestingness)
Let the rule discovered be {Bread,...} {Potato Chips}
Potato chips as consequent => Can be used to determine what
should be done to boost its sales
Bread in the antecedent => Can be used to see which
products would be affected if the store discontinues selling
bread
Bread in antecedent and Potato chips in the consequent =>
Can be used to see what products should be sold with Bread
to promote sale of Potato Chips
Association Rule Notation
Basic concepts
Given:
(1) database of transactions,
(2) each transaction is a list of items purchased by a
customer in a visit
Find:
all rules that correlate the presence of one set of items
(itemset) with that of another set of items
E.g., 35% of people who buys salmon also buys cheese
The model: data
I = {i1, i2, …, im}: a set of items
Transaction t :
t a set of items, and t I
Transaction Database T: a set of transactions
T = {t1, t2, …, tn}
Transaction data: Supermarket data
Market basket transactions:
t1: {bread, cheese, milk}
t2: {apple, eggs, salt, yogurt}
… …
tn: {biscuit, eggs, milk}
Concepts:
An item: an item/article in a basket
I: the set of all items sold in the store
A transaction: items purchased in a basket; it may have
TID (transaction ID)
A transactional dataset: A set of transactions
Definitions
Itemset TID Items
A collection of one or more items 1 Bread, Milk
• Example: {Milk, Bread, Diaper} 2 Bread, Diaper, Beer, Eggs
k-itemset 3 Milk, Diaper, Beer, Coke
• An itemset that contains k items 4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Support count ()
Frequency of occurrence of an itemset
E.g. ({Milk, Bread, Diaper}) = 2
Frequent Itemset
An itemset whose support is greater than or
equal to a minsup threshold
Definition: Association Rule
Association Rule TID Items
– An implication expression of the form 1 Bread, Milk
X Y, where X and Y are itemsets 2 Bread, Diaper, Beer, Eggs
– Example: 3 Milk, Diaper, Beer, Coke
{Milk, Diaper} {Beer} 4 Bread, Milk, Diaper, Beer
Rule Evaluation Metrics 5 Bread, Milk, Diaper, Coke
– Support (s)
Fraction of transactions that Example:
contain both X and Y {Milk, Diaper} Beer
– Confidence (c)
denotes the percentage of (Milk, Diaper, Beer) 2
s 0.4
transactions containing A which |T| 5
also contain B.
(Milk, Diaper, Beer) 2
c=Sup(A,B)/Sup(A) c 0.67
(Milk , Diaper) 3
Example
Support Confidence
Calculation Calculation
a. 3/5=0.6 a. 3/4= 0.75
b. 3/5=0.6 b. 3/3=1
c. 1/5=0.2 c. 1/2 = 0.5
d. 1/5=0.2 d. 1/3 = 0.33
e. 1/5=0.2 e. 1/1=1
f. 0 f. 0
Example
Why Support and Confidence
Support
is an important measure because a rule that has very low support may
occur simply by chance.
A low support rule is also likely to be uninteresting from a business
perspective because it may not be profitable to promote items that
customers seldom by together.
For these reasons, support is often used to eliminate uninteresting
rules.
Confidence,
measures the reliability of the inference made by a rule.
For a given rule X ~ Y, tbe higher tbe confidence, the more likely it is
for Y to be present in transactions that contain X.
Confidence also provides an estimate of the conditional probability of
Y given X.
Association Rule Mining Problem
Given a set of transactions T, the goal of association rule
mining is to find all rules having
support ≥ minsup threshold
confidence ≥ minconf threshold
where minsup and minconf are the corresponding support and confidence
thresholds.
Brute-force approach:
List all possible association rules
Compute the support and confidence for each rule
Prune rules that fail the minsup and minconf thresholds
Computationally prohibitive!
Computational Complexity
Given d unique items:
Total number of itemsets = 2d
Total number of possible association rules:
d
d1 d k
d k
R
k j
k 1 j 1
3 2 1
d d 1
If d=6, R = 602 rules
Mining Association Rules
TID Items
Example of Rules:
1 Bread, Milk {Milk,Diaper} {Beer} (s=0.4, c=0.67)
2 Bread, Diaper, Beer, Eggs {Milk,Beer} {Diaper} (s=0.4, c=1.0)
3 Milk, Diaper, Beer, Coke {Diaper,Beer} {Milk} (s=0.4, c=0.67)
4 Bread, Milk, Diaper, Beer
{Beer} {Milk,Diaper} (s=0.4, c=0.67)
{Diaper} {Milk,Beer} (s=0.4, c=0.5)
5 Bread, Milk, Diaper, Coke
{Milk} {Diaper,Beer} (s=0.4, c=0.5)
Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
Mining Association Rules
Two-step approach:
Frequent Itemset Generation
• Generate all itemsets whose support minsup.
• These itemsets are called frequent itemsets.
Rule Generation
• Generate high confidence rules from each frequent
itemset.
• These rules are called strong rules.
Frequent itemset generation is still computationally expensive.
Frequent Itemset Generation
Given d items, there are 2d
null possible candidate itemsets
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
Frequent Itemset Generation
Brute-force approach:
Each itemset in the lattice is a candidate frequent itemset
Count the support of each candidate by scanning the database
Transactions List of
Candidates
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke M
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Match w
each transaction against every candidate
If the candidate is contained in a transaction, its support count
will be incremented.
Complexity ~ O(NMw) => Expensive since M = 2d !!!
Frequent Itemset Generation Strategies
Reduce the number of candidates (M)
Complete search: M=2d
Use pruning techniques to reduce M
Reduce the number of transactions (N)
Reduce size of N as the size of itemset increases
Used by DHP and vertical-based mining algorithms
Reduce the number of comparisons (NM)
Use efficient data structures to store the candidates or
transactions
No need to match every candidate against every transaction
Reducing Number of Candidates
Apriori algorithm:
for finding frequent itemsets in a dataset
Name of the algorithm is Apriori because it uses prior
knowledge of frequent itemset properties.
We apply an iterative approach or level-wise search where k-
frequent itemsets are used to find k+1 itemsets.
Reducing Number of Candidates
Apriori principle:
If an itemset is frequent, then all of its subsets must also be
frequent
If an itemset is infrequent, all its supersets will be infrequent.
A transaction containing {beer, diaper, nuts} also contains
{beer, diaper}
{beer, diaper, nuts} is frequent {beer, diaper} must also be
frequent
Reducing Number of Candidates
Apriori principle:
If an itemset is frequent, then all of its subsets must also be
frequent
If an itemset is infrequent, all its supersets will be infrequent.
Apriori principle holds due to the following property of
the support measure:
X , Y : ( X Y ) s( X ) s(Y )
Support of an itemset never exceeds the support of its subsets
This is known as the anti-monotone property of support
Illustrating Apriori Principle
Illustrating Apriori Principle
null
If an itemset is
infrequent, then all of A B C D E
its supersets must also
be infrequent
AB AC AD AE BC BD BE CD CE DE
Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
Pruned
ABCDE
supersets
Example
Consider the following dataset and we will find frequent
itemsets and generate association rules for them.
minimum support count is 2
minimum confidence is 60%
Example
Step-1: K=1
(I) Create a table containing support count of each item
present in dataset – Called C1(candidate set)
(II) compare candidate set item’s support count with
minimum support count. This gives us itemset L1.
Example
Step-2: K=2
Generate candidate set C2 using L1 (this is
called join step). Condition of joining Lk-1 and
Lk-1 is that it should have (K-2) elements in
common.
Check all subsets of an itemset are frequent or
not and if not frequent remove that itemset.
(Example subset of{I1, I2} are {I1}, {I2} they
are frequent.Check for each itemset)
Now find support count of these itemsets by
searching in dataset.
Example
II) compare candidate (C2) support count with minimum support count
this gives us itemset L2.
Example
Step-3:
Generate candidate set C3 using L2 (join step).
Condition of joining Lk-1 and Lk-1 is that it should
have (K-2) elements in common. So here, for L2,
first element should match.
So itemset generated by joining L2 is {I1, I2, I3}
{I1, I2, I5}{I1, I3, i5}{I2, I3, I4}{I2, I4, I5}{I2, I3,
I5}
Check if all subsets of these itemsets are frequent or
not and if not, then remove that itemset.(Here
subset of {I1, I2, I3} are {I1, I2},{I2, I3},{I1, I3}
which are frequent. For {I2, I3, I4}, subset {I3, I4}
is not frequent so remove it. Similarly check for
every itemset)
find support count of these remaining itemset by
searching in dataset.
Example
(II) Compare candidate (C3) support count with minimum support count
this gives us itemset L3.
Step-4:
Generate candidate set C4 using L3 (join step). Condition of joining L k-
1 and Lk-1 (K=4) is that, they should have (K-2) elements in common.
So here, for L3, first 2 elements (items) should match.
Check all subsets of these itemsets are frequent or not (Here itemset
formed by joining L3 is {I1, I2, I3, I5} so its subset contains {I1, I3,
I5}, which is not frequent). So no itemset in C4
We stop here because no frequent itemsets are found further
Example
We have discovered all the frequent item-sets.
Now generation of strong association rule comes into picture.
For that we need to calculate confidence of each rule.
Confidence –
A confidence of 60% means that 60% of the customers, who
purchased milk and bread also bought butter.
Confidence(A->B)=Support_count(A∪B)/Support_count(A)
Example
Itemset {I1, I2, I3} //from L3
SO rules can be
[I1^I2]=>[I3] //confidence = sup(I1^I2^I3)/sup(I1^I2) =
2/4*100=50%
[I1^I3]=>[I2] //confidence = sup(I1^I2^I3)/sup(I1^I3) =
2/4*100=50%
[I2^I3]=>[I1] //confidence = sup(I1^I2^I3)/sup(I2^I3) =
2/4*100=50%
[I1]=>[I2^I3] //confidence = sup(I1^I2^I3)/sup(I1) = 2/6*100=33%
[I2]=>[I1^I3] //confidence = sup(I1^I2^I3)/sup(I2) = 2/7*100=28%
[I3]=>[I1^I2] //confidence = sup(I1^I2^I3)/sup(I3) = 2/6*100=33%
So if minimum confidence is 50%, then first 3 rules can be considered
as strong association rules.
Illustrating Apriori Principle
TID Items
Items (1-itemsets)
1 Bread, Milk
Item Count
2 Beer, Bread, Diaper, Eggs
Bread 4
3 Beer, Coke, Diaper, Milk Coke 2
4 Beer, Bread, Diaper, Milk Milk 4
Beer 3
5 Bread, Coke, Diaper, Milk Diaper 4
Eggs 1
Minimum Support = 2
If every subset is considered,
6
C1 + 6C2 + 6C3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16
Illustrating Apriori Principle
TID Items
Items (1-itemsets)
1 Bread, Milk
2 Beer, Bread, Diaper, Eggs Item Count
Bread 4
3 Beer, Coke, Diaper, Milk
Coke 2
4 Beer, Bread, Diaper, Milk Milk 4
5 Bread, Coke, Diaper, Milk Beer 3
Diaper 4
Eggs 1
Minimum Support = 2
If every subset is considered,
6
C1 + 6C2 + 6C3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16
Illustrating Apriori Principle
Item Count Items (1-itemsets)
Bread 4
Coke 2
Milk 4 Itemset Pairs (2-itemsets)
Beer 3 {Bread,Milk}
Diaper 4 (No need to generate
{Bread, Beer }
Eggs 1
{Bread,Diaper} candidates involving Coke
{Beer, Milk} or Eggs)
{Diaper, Milk}
{Beer,Diaper}
Minimum Support = 2
If every subset is considered,
6
C1 + 6C2 + 6C3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16
Illustrating Apriori Principle
Item Count Items (1-itemsets)
Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Beer, Bread} 2 (No need to generate
Eggs 1 {Bread,Diaper} 3 candidates involving Coke
{Beer,Milk} 2
{Diaper,Milk} 3 or Eggs)
{Beer,Diaper} 3
Minimum Support =2
If every subset is considered,
6
C1 + 6C2 + 6C3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16
Illustrating Apriori Principle
Item Count Items (1-itemsets)
Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Bread,Beer} 2 (No need to generate
Eggs 1
{Bread,Diaper} 3 candidates involving Coke
{Milk,Beer} 2 or Eggs)
{Milk,Diaper} 3
{Beer,Diaper} 3
Minimum Support = 2
Triplets (3-itemsets)
If every subset is considered,
Itemset
6
C1 + 6C2 + 6C3
{ Beer, Diaper, Milk}
6 + 15 + 20 = 41 { Beer,Bread,Diaper}
With support-based pruning, {Bread, Diaper, Milk}
6 + 6 + 4 = 16 { Beer, Bread, Milk}
Illustrating Apriori Principle
Item Count Items (1-itemsets)
Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Bread,Beer} 2 (No need to generate
Eggs 1
{Bread,Diaper} 3 candidates involving Coke
{Milk,Beer} 2 or Eggs)
{Milk,Diaper} 3
{Beer,Diaper} 3
Minimum Support = 2
Triplets (3-itemsets)
If every subset is considered, Itemset Count
6
C1 + 6C2 + 6C3 { Beer, Diaper, Milk} 2
6 + 15 + 20 = 41 { Beer,Bread, Diaper} 2
With support-based pruning, {Bread, Diaper, Milk} 2
6 + 6 + 4 = 16 {Beer, Bread, Milk} 1
6 + 6 + 1 = 13
Apriori Algorithm
Method:
Let k=1
Generate frequent itemsets of length 1
Repeat until no new frequent itemsets are identified
• Generate length (k+1) candidate itemsets from length k
frequent itemsets
• Prune candidate itemsets containing subsets of length k
that are infrequent
• Count the support of each candidate by scanning the DB
• Eliminate candidates that are infrequent, leaving only
those that are frequent
Generating AR from frequent itemsets
Confidence
For every frequent itemset x, generate all non-empty
subsets of x
For every non-empty subsets of x, output the rule
The Apriori Algorithm — Example
The Apriori Algorithm — Example
(Contd.)
Frequent Item set = {2,3,5}
Rules are: Association Confidence Confidence %
Rule
2^3 →5 2/2=1 100%
2^5→3 2/3=0.6 60%
3^5→2 2/2=1 100%
5→2^3 2/3=0.6 60%
3→2^5 2/3=0.6 60%
2→3^5 2/3=0.6 60%
If the minimum confidence threshold is 70%, then the
only strong rules are: 2^3 →5 & 3^5→2
The Apriori Algorithm—An Example
Supmin = 2
Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
{C} 3
20 B, C, E 1st scan {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2 2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
C3 Itemset
3rd scan L3 Itemset sup
49 {B, C, E} 2
{B, C, E}
Is Apriori Fast Enough? —
Performance Bottlenecks
The core of the Apriori algorithm:
Use frequent (k – 1)-itemsets to generate candidate frequent
k-itemsets
Use database scan and pattern matching to collect counts for
the candidate itemsets
The bottleneck of Apriori: Candidate generation
Huge candidate sets
Multiple scans of database
Problems with the association mining
Rare Item Problem: It assumes that all items in the data are
of the same nature and/or have similar frequencies.
Not true: In many applications, some items appear very
frequently in the data, while others rarely appear.
E.g., in a supermarket, people buy food processor and
cooking pan much less frequently than they buy bread and
milk.
Interestingness Measurements
How good is the association Rule?
Are all of the strong association rules discovered
interesting enough to present to the user?
How can we measure the interestingness of a rule?
Subjective measures
A rule (pattern) is interesting if
• it is unexpected (surprising to the user); and/or
• actionable (the user can do something with it)
• (only the user can judge the interestingness of a rule)
Apriori Advantages &
Disadvantages
Advantages:
Uses large itemset property.
Easily parallelized
Easy to implement.
Disadvantages:
Assumes transaction database is memory resident.
Requires up to m database scans.
Thank You