Data Science and Big Data Analytics
Chap 5: Adv Analytical Theory and
Methods: Association Rules
Chapter Sections
5.1 Overview
5.2 Apriori Algorithm
5.3 Evaluation of Candidate Rules
5.4 Example: Transactions in a Grocery Store
5.5 Validation and Testing
5.6 Diagnostics
5.1 Overview
Association rules method
Unsupervised learning method
Descriptive (not predictive) method
Used to find hidden relationships in data
The relationships are represented as rules
Questions association rules might answer
Which products tend to be purchased together
What products do similar customers tend to buy
5.1 Overview
Example – general logic of association rules
5.1 Overview
Rules have the form X -> Y
When X is observed, Y is also observed
Itemset
Collection of items or entities
k-itemset = {item 1, item 2,…,item k}
Examples
Items purchased in one transaction
Set of hyperlinks clicked by a user in one session
5.1 Overview – Apriori Algorithm
Apriori is the most fundamental algorithm
Given itemset L, support of L is the percent of
transactions that contain L
Frequent itemset – items appear together “often
enough”
Minimum support defines “often enough” (% transactions)
If an itemset is frequent, then any subset is frequent
5.1 Overview – Apriori Algorithm
If {B,C,D} frequent, then all subsets frequent
5.2 Apriori Algorithm
Frequent = minimum support
Bottom-up iterative algorithm
Identify the frequent (min support) 1-itemsets
Frequent 1-itemsets are paired into 2-itemsets,
and the frequent 2-itemsets are identified, etc.
Definitions for next slide
D = transaction database
d = minimum support threshold
N = maximum length of itemset (optional parameter)
Ck = set of candidate k-itemsets
Lk = set of k-itemsets with minimum support
5.2 Apriori Algorithm
5.3 Evaluation of Candidate Rules
Confidence
Frequent itemsets can form candidate rules
Confidence measures the certainty of a rule
Minimum confidence – predefined threshold
Problem with confidence
Given a rule X->Y, confidence considers only the
antecedent (X) and the co-occurrence of X and Y
Cannot tell if a rule contains true implication
5.3 Evaluation of Candidate Rules
Lift
Lift measures how much more often X and Y
occur together than expected if statistically
independent
Lift = 1 if X and Y are statistically independent
Lift > 1 indicates the degree of usefulness of the rule
Example – in 1000 transactions,
If {milk, eggs} appears in 300, {milk} in 500, and {eggs} in
400, then Lift(milk->eggs) = 0.3/(0.5*0.4) = 1.5
If {milk, bread} appears in 400, {milk} in 500, and {bread}
in 400, then Lift(milk->bread) = 0.4/(0.5*0.4) = 2.0
5.3 Evaluation of Candidate Rules
Leverage
Leverage measures the difference in the
probability of X and Y appearing together
compared to statistical independence
Leverage = 0 if X and Y are statistically independent
Leverage > 0 indicates degree of usefulness of rule
Example – in 1000 transactions,
If {milk, eggs} appears in 300, {milk} in 500, and {eggs} in
400, then Leverage(milk->eggs) = 0.3 - 0.5*0.4 = 0.1
If {milk, bread} appears in 400, {milk} in 500, and {bread}
in 400, then Leverage (milk->bread) = 0.4 - 0.5*0.4 = 0.2
5.4 Applications of Association Rules
The term market basket analysis refers to a
specific implementation of association rules
For better merchandising – products to
include/exclude from inventory each month
Placement of products within related products
Association rules also used for
Recommender systems – Amazon, Netflix
Clickstream analysis from web usage log files
Website visitors to page X click on links A,B,C more than on
links D,E,F
5.6 Validation and Testing
The frequent and high confidence itemsets are found by pre-
specified minimum support and minimum confidence levels
Measures like lift and/or leverage then ensure that
interesting rules are identified rather than coincidental ones
However, some of the remaining rules may be considered
subjectively uninteresting because they don’t yield
unexpected profitable actions
E.g., rules like {paper} -> {pencil} are not interesting/meaningful
Incorporating subjective knowledge requires domain experts
Good rules provide valuable insights for institutions to
improve their business operations
5.7 Diagnostics
Although minimum support is pre-specified in phases 3&4,
this level can be adjusted to target the range of the number
of rules – variants/improvements of Apriori are available
For large datasets the Apriori algorithm can be
computationally expensive – efficiency improvements
Partitioning
Sampling
Transaction reduction
Hash-based itemset counting
Dynamic itemset counting
arules in R
https://rpubs.com/emzak208/281776
https://
rpubs.com/aru0511/GroceriesDatasetAssociationAnaly
sis