Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
16 views16 pages

W5 - Apriori

Uploaded by

Kim Nguyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views16 pages

W5 - Apriori

Uploaded by

Kim Nguyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 16

Data Science and Big Data Analytics

Chap 5: Adv Analytical Theory and


Methods: Association Rules
Chapter Sections

 5.1 Overview
 5.2 Apriori Algorithm
 5.3 Evaluation of Candidate Rules
 5.4 Example: Transactions in a Grocery Store
 5.5 Validation and Testing
 5.6 Diagnostics
5.1 Overview
 Association rules method
 Unsupervised learning method
 Descriptive (not predictive) method
 Used to find hidden relationships in data
 The relationships are represented as rules
 Questions association rules might answer
 Which products tend to be purchased together
 What products do similar customers tend to buy
5.1 Overview
 Example – general logic of association rules
5.1 Overview
 Rules have the form X -> Y
 When X is observed, Y is also observed
 Itemset
 Collection of items or entities
 k-itemset = {item 1, item 2,…,item k}
 Examples
 Items purchased in one transaction
 Set of hyperlinks clicked by a user in one session
5.1 Overview – Apriori Algorithm

 Apriori is the most fundamental algorithm


 Given itemset L, support of L is the percent of
transactions that contain L
 Frequent itemset – items appear together “often
enough”
 Minimum support defines “often enough” (% transactions)
 If an itemset is frequent, then any subset is frequent
5.1 Overview – Apriori Algorithm
 If {B,C,D} frequent, then all subsets frequent
5.2 Apriori Algorithm
Frequent = minimum support
 Bottom-up iterative algorithm
 Identify the frequent (min support) 1-itemsets
 Frequent 1-itemsets are paired into 2-itemsets,
and the frequent 2-itemsets are identified, etc.
 Definitions for next slide
 D = transaction database
 d = minimum support threshold
 N = maximum length of itemset (optional parameter)
 Ck = set of candidate k-itemsets
 Lk = set of k-itemsets with minimum support
5.2 Apriori Algorithm
5.3 Evaluation of Candidate Rules
Confidence
 Frequent itemsets can form candidate rules
 Confidence measures the certainty of a rule

 Minimum confidence – predefined threshold


 Problem with confidence
 Given a rule X->Y, confidence considers only the
antecedent (X) and the co-occurrence of X and Y
 Cannot tell if a rule contains true implication
5.3 Evaluation of Candidate Rules
Lift
 Lift measures how much more often X and Y
occur together than expected if statistically
independent

 Lift = 1 if X and Y are statistically independent


 Lift > 1 indicates the degree of usefulness of the rule
 Example – in 1000 transactions,
 If {milk, eggs} appears in 300, {milk} in 500, and {eggs} in
400, then Lift(milk->eggs) = 0.3/(0.5*0.4) = 1.5
 If {milk, bread} appears in 400, {milk} in 500, and {bread}
in 400, then Lift(milk->bread) = 0.4/(0.5*0.4) = 2.0
5.3 Evaluation of Candidate Rules
Leverage
 Leverage measures the difference in the
probability of X and Y appearing together
compared to statistical independence

 Leverage = 0 if X and Y are statistically independent


 Leverage > 0 indicates degree of usefulness of rule
 Example – in 1000 transactions,
 If {milk, eggs} appears in 300, {milk} in 500, and {eggs} in
400, then Leverage(milk->eggs) = 0.3 - 0.5*0.4 = 0.1
 If {milk, bread} appears in 400, {milk} in 500, and {bread}
in 400, then Leverage (milk->bread) = 0.4 - 0.5*0.4 = 0.2
5.4 Applications of Association Rules

 The term market basket analysis refers to a


specific implementation of association rules
 For better merchandising – products to
include/exclude from inventory each month
 Placement of products within related products
 Association rules also used for
 Recommender systems – Amazon, Netflix
 Clickstream analysis from web usage log files
 Website visitors to page X click on links A,B,C more than on
links D,E,F
5.6 Validation and Testing
 The frequent and high confidence itemsets are found by pre-
specified minimum support and minimum confidence levels
 Measures like lift and/or leverage then ensure that
interesting rules are identified rather than coincidental ones
 However, some of the remaining rules may be considered
subjectively uninteresting because they don’t yield
unexpected profitable actions
 E.g., rules like {paper} -> {pencil} are not interesting/meaningful
 Incorporating subjective knowledge requires domain experts
 Good rules provide valuable insights for institutions to
improve their business operations
5.7 Diagnostics

 Although minimum support is pre-specified in phases 3&4,


this level can be adjusted to target the range of the number
of rules – variants/improvements of Apriori are available
 For large datasets the Apriori algorithm can be
computationally expensive – efficiency improvements
 Partitioning
 Sampling
 Transaction reduction
 Hash-based itemset counting
 Dynamic itemset counting
arules in R
 https://rpubs.com/emzak208/281776

 https://
rpubs.com/aru0511/GroceriesDatasetAssociationAnaly
sis

You might also like