Data Science with R
Lesson 13—Association
©©Copyright
Copyright 2015,
2015, Simplilearn.
Simplilearn. All rights
All rights reserved.
reserved.
Objectives
• Explain association rule mining and parameters of interesting relationships
After completing
this lesson, you will • Explain the Apriori algorithm and steps to find frequent item set
be able to:
© Copyright 2015, Simplilearn. All rights reserved.
Topic 1: Association Rule Mining
© Copyright 2015, Simplilearn. All rights reserved.
© Copyright 2015, Simplilearn. All rights reserved.
Association Rules
An association rule is a pattern that states when X occurs, Y occurs with a certain probability. A
transaction t contains X, a set of items (item set) in I, if X is a subset of t.
An association rule is an implication of the form:
XY
Where, X, Y I, and X Y =
© Copyright 2015, Simplilearn. All rights reserved.
Association Rule Mining
This is a classical Data Mining technique that:
• Finds out interesting patterns in a dataset
• Assumes all data elements as categorical
• Is not suitable for numeric data
Brute-force solutions cannot solve the problem of finding different combinations of items in less time and
! computing power.
© Copyright 2015, Simplilearn. All rights reserved.
Application Areas of Association Rule Mining
Some examples are:
Market Basket Data Analysis
Purchase Data Analysis
Website Traffic Analysis
© Copyright 2015, Simplilearn. All rights reserved.
Parameters of Interesting Relationships
Interesting relationships have two parameters:
• Frequent item sets: Collection of items occurring together frequently
• Association rules: Indicators of a strong relationship between two items
Example:
In the “Items” table below, {wine, diapers, soy milk} is the frequent item set
and diapers ➞ Wine is an association rule:
© Copyright 2015, Simplilearn. All rights reserved.
Association Rule Strength Measures
The measures of the strength of association rules are explained below:
Support Confidence
For an item set, it is the percentage of the dataset that The confidence for the rule {diapers} ➞ {wine} is
contains this item set. defined as support({diapers, wine})/support({diapers}).
The rule holds with support sup in T, if sup% of
of undertaking the project The rule holds in T with confidence conf if conf% of
transactions contain X Y.
ascertaining the costs and benefits transactions that contain X also contain Y.
sup = Pr(X Y). conf = Pr(Y | X)
Example: In the “Items” table, the support of {soy milk} Example: In the “Items” table, the confidence for
is 4/5 and of {soymilk, diapers} is 3/5. diapers ➞ wine is 3/5/4/5 = 3/4 = 0.75.
© Copyright 2015, Simplilearn. All rights reserved.
Limitations of Support and Confidence
While support and confidence can help you quantify the success of
association analysis, for thousands of sale items, the process of finding
them can be really slow.
In such cases, you can use algorithms such as Apriori.
© Copyright 2015, Simplilearn. All rights reserved.
Topic 2: Apriori Algorithm
© Copyright 2015, Simplilearn. All rights reserved.
© Copyright 2015, Simplilearn. All rights reserved.
Apriori Algorithm: Meaning
All possible item sets from the set {1, 2, 3}
This algorithm:
• Helps reduce the number of possible interesting item sets
• Assumes that if an item set is frequent, all of its subsets are also
frequent
With infrequent item sets highlighted
© Copyright 2015, Simplilearn. All rights reserved.
Apriori Algorithm: Example
To understand its application, consider the below “Shopping Baskets” items set, which ignores some
important parameters, such as quantities of items and price paid:
t1: Beef, Chicken, Milk
t2: Beef, Cheese
t3: Cheese, Boots
t4: Beef, Chicken, Cheese
t5: Beef, Chicken, Clothes, Cheese, Milk
t6: Chicken, Clothes, Milk
t7: Chicken, Milk, Clothes
© Copyright 2015, Simplilearn. All rights reserved.
Applying Apriori Algorithm: Steps
It includes two steps:
Mine all frequent item sets
Generate rules from frequent item sets
Assume:
• minsup = 30%
• minconf = 80%
An example frequent item set:
{Chicken, Clothes, Milk} [sup = 3/7]
Association rules from the item set:
Clothes Milk, Chicken [sup = 3/7, conf = 3/3]
… …
Clothes, Chicken Milk, [sup = 3/7, conf = 3/3]
© Copyright 2015, Simplilearn. All rights reserved.
Step 1: Mine All Frequent Item Sets
Visual Depiction
A frequent item set is:
• The one with sup ≥ minsup
• Any subset of a frequent item set
© Copyright 2015, Simplilearn. All rights reserved.
Algorithm to Find Frequent Item Set
Also called level-wise search, it includes the following steps:
Find all 1-item frequent item sets; then all 2-item frequent item sets, and so on
In each iteration k, consider item sets that contain some k-1 frequent item sets
Find frequent item sets of size 1: F1
! With k = 2, Ck = item sets of size k that could be frequent, given Fk-1, and Fk = item sets that are actually frequent, Fk Ck.
© Copyright 2015, Simplilearn. All rights reserved.
Finding Frequent Item Set—Example
Consider the below dataset T with minsup = 0.5:
TID Items
T100 1, 3, 4
T200 2, 3, 5
T300 1, 2, 3, 5
T400 2, 5
itemset:count
1. scan T C1: {1}:2, {2}:3, {3}:3, {4}:1, {5}:3
F1: {1}:2, {2}:3, {3}:3, {5}:3
C2: {1,2}, {1,3}, {1,5}, {2,3}, {2,5}, {3,5}
2. scan T C2: {1,2}:1, {1,3}:2, {1,5}:1, {2,3}:2, {2,5}:3, {3,5}:2
F2: {1,3}:2, {2,3}:2, {2,5}:3, {3,5}:2
C3: {2, 3,5}
3. scan T C3: {2, 3, 5}:2 F3: {2, 3, 5}
© Copyright 2015, Simplilearn. All rights reserved.
Ordering Items
The items in I are sorted in lexicographic order (total order).
• In each item set, it is used throughout the algorithm.
• {w[1], w[2], …, w[k]} represents a k-item set, where w consists of items w[1], w[2], …, w[k], where
w[1] < w[2] < … < w[k].
© Copyright 2015, Simplilearn. All rights reserved.
Ordering Items (contd.)
The algorithm for ordering items is:
C1 init-pass(T);
F1 {f | f C1, f.count/n minsup}; // n: no. of transactions in T
for (k = 2; Fk-1 ; k++) do
Ck candidate-gen(Fk-1);
for each transaction t T do
for each candidate c Ck do
if c is contained in t then
c.count++;
end
end
Fk {c Ck | c.count/n minsup}
end
return F k Fk;
© Copyright 2015, Simplilearn. All rights reserved.
Candidate Generation
The candidate-gen function takes Fk-1 and returns candidates as the superset of the set of all frequent k
item sets. It includes two steps:
1
Join: Generate all possible candidate item sets Ck of length k
2
Prune: Remove the candidates in Ck that cannot be frequent
© Copyright 2015, Simplilearn. All rights reserved.
Candidate Generation (contd.)
The algorithm for candidate generation is:
Function candidate-gen(Fk-1)
Ck ;
forall f1, f2 Fk-1
with f1 = {i1, … , ik-2, ik-1}
and f2 = {i1, … , ik-2, i’k-1}
and ik-1 < i’k-1 do
c {i1, …, ik-1, i’k-1}; // join f1 and f2
Ck Ck {c};
for each (k-1)-subset s of c do
if (s Fk-1) then
delete c from Ck; // prune
end
end
return Ck;
© Copyright 2015, Simplilearn. All rights reserved.
Candidate Generation: Example
Assume F3 = {{1, 2, 3}, {1, 2, 4}, {1, 3, 4}, {1, 3, 5}, {2, 3, 4}}, then:
After join C4 = {{1, 2, 3, 4}, {1, 3, 4, 5}}
After prune C4 = {{1, 2, 3, 4}}
© Copyright 2015, Simplilearn. All rights reserved.
Step 2—Generate Rules from Frequent Item Sets
For each frequent item set X and proper nonempty subset A of X, assume B = X – A.
A B is an association rule if:
Confidence(A B) ≥ minconf
support(A B) = support(AB) = support(X)
confidence(A B) = support(A B) / support(A)
© Copyright 2015, Simplilearn. All rights reserved.
Generate Rules from Frequent Item Sets—Example
Assume {2,3,4} is frequent with sup = 50% and proper nonempty subsets: {2,3}, {2,4}, {3,4}, {2}, {3}, {4},
with sup = 50%, 50%, 75%, 75%, 75%, 75%, respectively.
Association rules:
2,3 4, confidence = 100%
2,4 3, confidence = 100%
3,4 2, confidence = 67%
2 3,4, confidence = 67%
3 2,4, confidence = 67%
4 2,3, confidence = 67%
Support of all rules = 50%
© Copyright 2015, Simplilearn. All rights reserved.
Demo—Perform Association Using the Apriori Algorithm
This demo will show the steps to do association using the Apriori algorithm.
© Copyright 2015, Simplilearn. All rights reserved.
Demo—Perform Visualization on Associated Rules
This demo will show the steps to do visualization on associated rules.
© Copyright 2015, Simplilearn. All rights reserved.
Problems with Association Mining
Some problems related with association mining are:
Single minsup It assumes that all data items have similar frequencies and/or are of the same nature.
False Items Some items appear very frequently, whereas others appear rarely.
Items Frequencies If minsup is high, rules with rare items are not found; if minsup is set low, it may cause
Variation combinatorial explosion.
© Copyright 2015, Simplilearn. All rights reserved.
Quiz
© Copyright 2015, Simplilearn. All rights reserved.
QUIZ
Association rules are interesting:
1
a. if they satisfy both minimum and maximum iterations.
b. if they satisfy both minimum support and minimum confidence
thresholds.
c. if they satisfy both association correlations.
d. if they satisfy Apriori constants.
© Copyright 2015, Simplilearn. All rights reserved.
QUIZ
Association rules are interesting:
1
a. if they satisfy both minimum and maximum iterations.
b. if they satisfy both minimum support and minimum confidence
thresholds.
c. if they satisfy both association correlations.
d. if they satisfy Apriori constants.
The correct answer is b.
Explanation: Association rules are interesting if they satisfy both minimum support and
minimum confidence thresholds.
© Copyright 2015, Simplilearn. All rights reserved.
QUIZ
What is the formula to calculate support?
2
a. Pr(X | Y)
b. Pr(X Y)
c. Pr(X * Y)
d. Pr(X / Y)
© Copyright 2015, Simplilearn. All rights reserved.
QUIZ
What is the formula to calculate support?
2
a. Pr(X | Y)
b. Pr(X Y)
c. Pr(X * Y)
d. Pr(X / Y)
The correct answer is b.
Explanation: The formula to calculate Support is Pr(X Y).
© Copyright 2015, Simplilearn. All rights reserved.
QUIZ Which of the following algorithms can be used to solve the problem of support and
3 confidence?
a. Candidate generation
b. Classification
c. Apriori
d. Item set
© Copyright 2015, Simplilearn. All rights reserved.
QUIZ Which of the following algorithms can be used to solve the problem of support and
3 confidence?
a. Candidate generation
b. Classification
c. Apriori
d. Item set
The correct
The answers
correct answerare
is b.c.
Explanation: The Apriori algorithm can be used to solve the problem of support and
confidence.
© Copyright 2015, Simplilearn. All rights reserved.
QUIZ
Which of the following conditions is true for mining frequent item sets?
4
a. sup < minsup
b. sup < minsup
c. sup = minsup
d. sup ≥ minsup
© Copyright 2015, Simplilearn. All rights reserved.
QUIZ
Which of the following conditions is true for mining frequent item sets?
4
a. sup < minsup
b. sup < minsup
c. sup = minsup
d. sup ≥ minsup
The correct answer is d.
Explanation: sup ≥ minsup is true for mining frequent item sets.
© Copyright 2015, Simplilearn. All rights reserved.
Summary
Summary
Let us summarize the • Association rule mining finds out interesting patterns in a dataset.
topics covered in this • The interesting relationships can have two parameters: frequent item sets and
lesson:
association rules.
• An association rule is a pattern that states when X occurs, Y occurs with a
certain probability.
• The measures of the strength of association rules are support and confidence.
• While support and confidence can help quantify the success of
association analysis, for thousands of sale items, the process can be
really slow, which is solved by algorithms, such as Apriori.
• The Apriori algorithm includes two steps: mining all frequent item sets and
generating rules from frequent item sets.
© Copyright 2015, Simplilearn. All rights reserved.
This concludes “Association.”
This is the last lesson of the course.
© Copyright 2015, Simplilearn. All rights reserved.
© Copyright 2015, Simplilearn. All rights reserved.