Data Science with R
Unit V (Part-1) : Association Rules
M. Narasimha Raju
Asst Professor, Dept. of Computer Science &
Engineering
Association Rules:
Association Rules Regression
Overview Linear Regression
Apriori Algorithm Logistic Regression
Evaluation of Candidate Reasons to Choose and
Rules Cautions.
Applications of Association
Rules
Example
Validation and Testing
2
Overview
Given a large collection of transactions, in which each
transaction consists of one or more items, association rules go
through the items being purchased to see what items are
frequently bought together and to discover a list of rules that
describe the purchasing behavior.
The goal with association rules is to discover interesting
relationships among the items.
The relationships that are interesting depend both on the
business context and the nature of the algorithm being used
for the discovery.
3
The general logic behind association rules
4
Association
Each of the uncovered rules is in the form X → Y, meaning
that when item X is observed, item Y is also observed. In
this case, the left-hand side (LHS) of the rule is X, and the
right-hand side (RHS) of the rule is Y.
Using association rules, patterns can be discovered from
the data that allow the association rule algorithms to
disclose rules of related product purchases.
5
Market basket analysis.
Each transaction can be viewed as the shopping basket of a
customer that contains one or more items. This is also known as an
itemset.
The term itemset refers to a collection of items or individual
entities that contain some kind of relationship.
This could be a set of retail items purchased together in one
transaction, a set of hyperlinks clicked on by one user in a single
session, or a set of tasks done in one day.
An itemset containing k items is called a k-itemset. {item 1,item 2,.
. . item k} to denote a k-itemset.
Computation of the association rules is typically based on itemsets.
6
Apriori - Suppot
It pioneered the use of support for pruning the itemsets
and controlling the exponential growth of candidate
itemsets.
Given an itemset L, the support of L is the percentage of
transactions that contain L.
For example, if 80% of all transactions contain itemset
{bread}, then the support of {bread} is 0.8.
Similarly, if 60% of all transactions contain itemset
{bread, butter}, then the support of {bread, butter} is
0.6.
7
Minimum support
A frequent itemset has items that appear together often
enough.
If the minimum support is set at 0.5, any itemset can be
considered a frequent itemset if at least 50% of the
transactions contain this itemset.
The support of a frequent itemset should be greater than or
equal to the minimum support
If an itemset is considered frequent, then any subset of the
frequent itemset must also be frequent.
This is referred to as the Apriori property (or downward
closure property). 8
Frequent item sets
If 60% of the transactions
contain {bread,jam}, then at
least 60% of all the transactions
will contain {bread} or {jam}.
In other words, when the
support of {bread,jam} is 0.6,
the support of {bread} or
{jam} is at least 0.6.
If itemset {B,C,D} is frequent,
then all the subsets of this
itemset, shaded, must also be
frequent itemsets.
9
Apriori Algorithm
The Apriori algorithm takes a bottom-up iterative approach to uncovering
the frequent itemsets by first determining all the possible items (or 1-
itemsets, for example {bread}, {eggs}, {milk}, …) and then identifying
which among them are frequent.
Assuming the minimum support threshold (or the minimum support
criterion) is set at 0.5, the algorithm identifies and retains those itemsets
that appear in at least 50% of all transactions and discards (or “prunes
away”) the itemsets that have a support less than 0.5 or appear in fewer
than 50% of the transactions.
The word prune is used like it would be in gardening, where unwanted
branches of a bush are clipped away.
10
Apriori algorithm
In the next iteration of the Apriori algorithm, the identified frequent 1-itemsets are
paired into 2-itemsets (for example, {bread,eggs}, {bread,milk}, {eggs,milk}, …)
and again evaluated to identify the frequent 2-itemsets among them.
At each iteration, the algorithm checks whether the support criterion can be met;
if it can, the algorithm grows the itemset, repeating the process until it runs out of
support or until the itemsets reach a predefined length.
Let variable Ck be the set of candidate k-itemsets and variable Lk be the set of k-
itemsets that satisfy the minimum support. Given a transaction database D, a
minimum support threshold δ, and an optional parameter N indicating the
maximum length an itemset could reach, Apriori iteratively computes frequent
itemsets Lk+1 based on Lk.
11
Apriori algorithm
12
Apriori algorithm
The first step of the Apriori algorithm is to identify the frequent
itemsets by starting with each item in the transactions that meets the
predefined minimum support threshold δ.
These itemsets are 1-itemsets denoted as L1, as each 1-itemset
contains only one item.
Next, the algorithm grows the itemsets by joining L1 onto itself to
form new, grown 2-itemsets denoted as L2 and determines the
support of each 2-itemset in L2.
Those itemsets that do not meet the minimum support threshold δ are
pruned away.
The growing and pruning process is repeated until no itemsets meet
the minimum support threshold.
Once completed, output of the Apriori algorithm is the collection of all
the frequent k-itemsets.
13
Evaluation of Candidate Rules
Confidence is defined as the measure of certainty or
trustworthiness associated with each discovered rule.
Confidence is the percent of transactions that contain both X
and Y out of all the transactions that contain X
For example, if {bread, eggs, milk} has a support of 0.15 and
{bread, eggs} also has a support of 0.15, the confidence of
rule {bread, eggs}→{milk} is 1, which means 100% of the
time a customer buys bread and eggs, milk is bought as well.
14
Evaluation of Candidate Rules
A relationship may be thought of as interesting when the algorithm
identifies the relationship with a measure of confidence greater
than or equal to a predefined threshold.
This predefined threshold is called the minimum confidence.
Lift measures how many times more often X and Y occur together
than expected if they are statistically independent of each other.
Lift is a measure of how X and Y are really related rather than
coincidentally happening together
Lift is 1 if X and Y are statistically independent of each other. In
contrast, a lift of X → Y greater than 1 indicates that there is some
usefulness to the rule. A larger value of lift suggests a greater
strength of the association between X and Y.
15
Evaluation of Candidate Rules
Assuming 1,000 transactions, with {milk,eggs} appearing in
300 of them, {milk} appearing in 500, and {eggs} appearing
in 400, then Lift(milk→eggs)=0.3/(0.5* 0.4)=1.5.
If {bread} appears in 400 transactions and {milk,bread}
appears in 400, then Lift(milk→bread)=0.4 /(0.5* 0.4)=2.
Therefore it can be concluded that milk and bread have a
stronger association than milk and eggs.
16
Leverage measures the difference in the probability of X and Y appearing
together in the dataset compared to what would be expected if X and Y were
statistically independent of each other.
Leverage is 0 when X and Y are statistically independent of each other.
If X and Y have some kind of relationship, the leverage would be greater than
zero.
A larger leverage value indicates a stronger relationship between X and Y.
For the previous example, Leverage(milk→eggs)=0.3−(0.5* 0.4)=0.1 and
Leverage(milk→bread)=0.4−(0.5* 0.4)=0.2.
It again confirms that milk and bread have a stronger association than milk
and eggs.
17
Applications of Association Rules
Broad-scale approaches to better merchandising—what
products should be included in or excluded from the inventory
each month
Cross-merchandising between products and high-margin or
high-ticket items
Physical or logical placement of product within related
categories of products
Promotional programs—multiple product purchase
incentives managed through a loyalty card program
18
Recommendation Systems
Many online service providers such as Amazon and Netflix use
recommender systems.
Recommender systems can use association rules to discover
related products or identify customers who have similar
interests.
For example, association rules may suggest that those
customers who have bought product A have also bought
product B, or those customers who have bought products A, B,
and C are more similar to this customer.
These findings provide opportunities for retailers to cross-sell
their products.
19
Clickstream analysis
Clickstream analysis refers to the analytics on data related to
web browsing and user clicks, which is stored on the client or
the server side.
Web usage log files generated on web servers contain huge
amounts of information, and association rules can potentially
give useful knowledge to web usage data analysts.
For example, association rules may suggest that website
visitors who land on page X click on links A, B, and C much
more often than links D, E, and F.
This observation provides valuable insight on how to better
personalize and recommend the content to site visitors.
20
An Example: Transactions in a Grocery Store
Using R and the arules and arulesViz packages
The Groceries Dataset
21
The class of the dataset is
transactions, as defined by the arules
package. The transactions class
contains three slots:
transactionInfo: A data frame with
vectors of the same length as the
number of transactions
itemInfo: A data frame to store item
labels
data: A binary incidence matrix that
indicates which item labels appear in
every transaction
22
23
Frequent Itemset Generation
The apriori() function from the arule package implements the Apriori algorithm to
create frequent itemsets.
Note that, by default, the apriori() function executes all the iterations at once.
Assume that the minimum support threshold is set to 0.02 based on management
discretion.
Because the dataset contains 9,853 transactions, an itemset should appear at least
198 times to be considered a frequent itemset.
The first iteration of the Apriori algorithm computes the support of each product in
the dataset and retains those products that satisfy the minimum support.
The following code identifies 59 frequent 1-itemsets that satisfy the minimum
support.
The parameters of apriori() specify the minimum and maximum lengths of the
itemsets, the minimum support threshold, and the target indicating the type of
association mined.
24
25
26
27
28
29
30
31
Rule Generation and Visualization
32
plot(rules)
The scatterplot shows that, of the 2,918 rules generated from
the Groceries dataset, the highest lift occurs at a low support
and a low confidence.
33
Entering plot(rules@quality) displays a scatterplot matrix
(Figure 5-4) to compare the support, confidence, and lift of
the 2,918 rules.
lift is
proportional to
confidence and
illustrates
several linear
groupings.
34
Lift=Confidence / Support(Y).
when the support of Y remains the same, lift is proportional to
confidence, and the slope of the linear trend is the reciprocal
of Support(Y).
35
36
37
38
Validation and Testing
After gathering the output rules, it may become
necessary to use one or more methods to validate the
results in the business context for the sample dataset.
The first approach can be established through statistical
measures such as confidence, lift, and leverage.
Rules that involve mutually independent items or cover
few transactions are considered uninteresting because
they may capture spurious relationships.
39
Confidence measures the chance that X and Y appear together in
relation to the chance X appears.
Confidence can be used to identify the interestingness of the rules.
Lift and leverage both compare the support of X and Y against their
individual support.
While mining data with association rules, some rules generated could be
purely coincidental.
For example, if 95% of customers buy X and 90% of customers buy Y,
then X and Y would occur together at least 85% of the time, even if there
is no relationship between the two.
Measures like lift and leverage ensure that interesting rules are
identified rather than coincidental ones.
40
Diagnostics
Although the Apriori algorithm is easy to understand and
implement, some of the rules generated are uninteresting or
practically useless.
Additionally, some of the rules may be generated due to coincidental
relationships between the variables.
Measures like confidence, lift, and leverage should be used along
with human insights to address this problem.
The Apriori algorithm reduces the computational workload by only
examining itemsets that meet the specified minimum threshold.
However, depending on the size of the dataset, the Apriori algorithm
can be computationally expensive.
For each level of support, the algorithm requires a scan of the entire
41
Approaches to improve Apriori’s efficiency:
Partitioning: Any itemset that is potentially frequent in a transaction
database must be frequent in at least one of the partitions of the
transaction database.
Sampling: This extracts a subset of the data with a lower support
threshold and uses the subset to perform association rule mining.
Transaction reduction: A transaction that does not contain frequent k-
itemsets is useless in subsequent scans and therefore can be ignored.
Hash-based itemset counting: If the corresponding hashing bucket
count of a k-itemset is below a certain threshold, the k-itemset cannot
be frequent.
Dynamic itemset counting: Only add new candidate itemsets when all
of their subsets are estimated to be frequent.
42
Data Science with R
Unit V (Part-2) : Association Rules With R Programming
Thank
You
M. Narasimha Raju
Asst Professor, Dept. of Computer Science &
Engineering