MINING FREQUENT PATTERNS ASSOCIATIONS AND
CLASSIFICATION
# Basic Concepts of and association rules:
Frequent patterns are patterns (e.g., itemsets, subsequences, or substructures) that appear
frequently in a data set. For example, a set of items, such as milk and bread, that appear
frequently together in a transaction data set is a frequent itemset.
A subsequence, such as buying first a PC, then a digital camera, and then a memory card, if
it occurs frequently in a shopping history database, is a (frequent) sequential pattern.
A substructure can refer to different structural forms, such as subgraphs, subtrees, or
sublattices, which may be combined with itemsets or subsequences. If a substructure
occurs frequently, it is called a (frequent) structured pattern.
Market Basket Analysis: A Motivating Example
Frequent itemset mining leads to the discovery of associations and correlations among
items in large transactional or relational data sets. A typical example of frequent itemset
mining is market basket analysis. This process analyzes customer buying habits by finding
associations between the different items that customers place in their “shopping baskets”.
For example, the information that
customers who purchase
computers also tend to buy
antivirus software at the same
time is represented in the
following association rule:
computer ⇒ antivirus software
[support = 2%,confidence = 60%].
Rule support and confidence are
two measures of rule
interestingness. A support of 2%
for Rule means that 2% of all the
transactions under analysis show
that computer and antivirus
software are purchased together. A confidence of 60% means that 60% of the customers
who purchased a computer also bought the software. Typically, association rules are
considered interesting if they satisfy both a minimum support threshold and a minimum
confidence threshold.
Frequent Itemsets, Closed Itemsets, and Association Rules
Let I = {I1, I2,..., Im} be an itemset. Let D, the task-relevant data, be a set of database
transactions where each transaction T is a nonempty itemset such that T ⊆ I. Each
transaction is associated with an identifier, called a TID. Let A be a set of items. An
association rule is an implication of the form A ⇒ B, where A ⊂ I, B ⊂ I, A ≠ ∅, B ≠ ∅, and A
∩B = ∅.
Rules that satisfy both a minimum support
threshold (min sup) and a minimum confidence
threshold (min conf ) are called strong.
An (frequent) itemset is called closed if it has no (frequent) superset having the same
support.
In general, association rule mining can be viewed as a two-step process:
1. Find all frequent itemsets: By definition, each of these itemsets will occur at least
as frequently as a predetermined minimum support count, min sup.
2. Generate strong association rules from the frequent itemsets: By definition, these
rules must satisfy minimum support and minimum confidence.
#Frequent itemset Mining Methods
1.Apriori Algorithm: Finding Frequent Itemsets
Apriori algorithm proposed by R. Agrawal and R. Srikant in 1994 for mining frequent
itemsets for Boolean association rules. The name of the algorithm is based on the fact that
the algorithm uses prior knowledge of frequent itemset properties.
Apriori employs an iterative approach known as a level-wise search, where k-itemsets are
used to explore (k + 1)-itemsets.
First, the set of frequent 1-itemsets is found by scanning the database to accumulate the
count for each item, and collecting those items that satisfy minimum support. The resulting
set is denoted by L1. Next, L1 is used to find L2, the set of frequent 2-itemsets, which is
used to find L3, and so on, until no more frequent k-itemsets can be found. The finding of
each Lk requires one full scan of the database.
To improve the efficiency of the level-wise generation of frequent itemsets, an important
property called the Apriori property is used to reduce the search space.
Apriori property: All nonempty subsets of a frequent itemset must also be frequent.
The Apriori property is based on the following observation. By definition, if an itemset I
does not satisfy the minimum support threshold, then I is not frequent, that is, P(I) < min
sup. If an item A is added to the itemset I, then the resulting itemset (i.e., I ∪A) cannot
occur more frequently than I. Therefore, I ∪A is not frequent either, that is, P(I ∪A) < min
sup. This property belongs to a special category of properties called antimonotonicity in the
sense that if a set cannot pass a test, all of its supersets will fail the same test as
well.
Example: Generation of the candidate itemsets and frequent itemsets, where the
minimum support count is 2.
1. In the first iteration of the algorithm, each item is a member of the set of candidate
1-itemsets, C1. The algorithm simply scans all of the transactions to count the
number of occurrences of each item.
2. Suppose that the minimum support count required is 2, that is, min sup = 2. The set
of frequent 1-itemsets, L1, can then be determined. In our example, all of the
candidates in C1 satisfy minimum support.
3. To discover the set of frequent 2-itemsets, L2, the algorithm uses the join L1 ⋈ L1 to
generate a candidate set of 2-itemsets, C2.
4. Next, the transactions in D are scanned and the support count of each candidate
itemset in C2 is accumulated.
5. The set of frequent 2-itemsets, L2, is then determined, consisting of those candidate
2-itemsets in C2 having minimum support.
6. The generation of the set of the candidate 3-itemsets, C3.
7. The transactions in D are scanned to determine L3, consisting of those candidate 3-
itemsets in C3 having minimum support.
8. The algorithm uses L3 ⋈ L3 to generate a candidate set of 4-itemsets, C4. Although
the join results in {{I1, I2, I3, I5}}, itemset {I1, I2, I3, I5} is pruned because its
subset {I2, I3, I5} is not frequent. Thus, C4 = ∅, and the algorithm terminates,
having found all of the frequent itemsets.
2.Generating Association Rules from Frequent Itemsets
Once the frequent itemsets from transactions in a database D have been found, it is
straightforward to generate strong association rules from them.
The conditional probability is expressed in terms of itemset support count, where support
count(A ∪B) is the number of transactions containing the itemsets A ∪B, and support
count(A) is the number of transactions containing the itemset A. Association rules can be
generated as follows:
For each frequent itemset l, generate all nonempty subsets of l.
For every nonempty subset s of l, output the rule “s ⇒ (l − s)” if (support count(l)/
support count(s)) ≥ min_conf.
Example Generating association rules. Let’s try an example based on the transactional data
for AllElectronics. The data contain frequent itemset X = {I1, I2, I5}. The association rules
that can be generated from X are nonempty subsets {I1, I2}, {I1, I5}, {I2, I5}, {I1}, {I2},
and {I5}. The resulting association rules with its confidence:
If the minimum confidence threshold is, say,
70%, then only the second, third, and last
rules are output, because these are the only ones generated that are strong.
3.Improving the Efficiency of Aprior
Many variations of the Apriori algorithm have been proposed that focus on improving the
efficiency of the original algorithm. Several of these variations are summarized as follows:
Hash-based technique (hashing itemsets into corresponding buckets): A hash-
based technique can be used to reduce the size of the candidate k-itemsets, Ck , for
k > 1.
Transaction reduction (reducing the number of transactions scanned in future
iterations): A transaction that does not contain any frequent k-itemsets cannot
contain any frequent (k + 1)-itemsets. Therefore, such a transaction can be marked
or removed from further consideration.
Partitioning (partitioning the data to find candidate itemsets): A partitioning
technique can be used that requires just two database scans to mine the frequent
itemsets. It consists of two phases. In phase I, the algorithm divides the transactions
of D into n nonoverlapping partitions. In phase II, a second scan of D is conducted in
which the actual support of each candidate is assessed to determine the global
frequent itemsets.
Sampling (mining on a subset of the given data): The basic idea of the sampling
approach is to pick a random sample S of the given data D, and then search for
frequent itemsets in S instead of D.
4.FP-Growth Approach for Mining Frequent Itemsets
Frequent pattern growth, or simply FP-growth, which adopts a divide-and-conquer strategy
for Mining Frequent Itemsets. First, it compresses the database representing frequent items
into a frequent pattern tree, or FP-tree, which retains the itemset association information. It
then divides the compressed database into a set of conditional databases (a special kind of
projected database), each associated with one frequent item or “pattern fragment,” and
mines each database separately. For each “pattern fragment,” only its associated data sets
need to be examined.
Example FP-growth (finding frequent itemsets without candidate generation). We
reexamine the mining of transaction database, D using the frequent pattern growth
approach. The first scan of the database is the same as Apriori, which derives the set of
frequent items (1-itemsets) and their support counts (frequencies).
Let the minimum support count be 2. The set of frequent items is sorted in the order of
descending support count. This resulting set or list is denoted by L. Thus, we have L = {{I2:
7}, {I1: 6}, {I3: 6}, {I4: 2}, {I5: 2}}.
An FP-tree is then constructed as follows. First, create the root of the tree, labeled with
“null.” Scan database D a second time. The items in each transaction are processed in L
order (i.e., sorted according to descending support count), and a branch is created for each
transaction. For example, the scan of the first transaction, “T100: I1, I2, I5,” which contains
three items (I2, I1, I5 in L order), leads to the construction of the first branch of the tree
with three nodes, <I2: 1>, <I1: 1>, and <I5: 1>, where I2 is linked as a child to the root, I1
is linked to I2, and I5 is linked to I1.
The second transaction, T200, contains the items I2 and I4 in L order, which would result in
a branch where I2 is linked to the root and I4 is linked to I2. However, this branch would
share a common prefix, I2, with the existing path for T100. Therefore, we instead increment
the count of the I2 node by 1, and create a new node, <I4: 1>, which is linked as a child to
<I2: 2>.
5.Mining Frequent Itemsets Using the Vertical Data Format
Both the Apriori and FP-growth methods mine frequent patterns from a set of transactions
in TID-itemset format (i.e., {TID : itemset}), where TID is a transaction ID and itemset is the
set of items bought in transaction TID. This is known as the horizontal data format.
Alternatively, data can be presented in item-TID set format (i.e., {item : TID set}), where
item is an item name, and TID set is the set of transaction identifiers containing the item.
This is known as the vertical data format. Frequent itemsets can also be mined efficiently
using vertical data format, which is the essence of the Eclat (Equivalence Class
Transformation) algorithm.
Example
#Pattern evaluation methods- From Association Mining to
Correlation Analysis
A correlation measure can be used to augment the support–confidence framework for
association rules. This leads to correlation rules of the form:
A ⇒ B [support, confidence, correlation]
That is, a correlation rule is measured not only by its support and confidence but also by
the correlation between itemsets A and B. There are many different correlation measures
from which to choose.
1.Lift:
Lift is a simple correlation measure that is given as follows. The occurrence of itemset A is
independent of the occurrence of itemset B if P(A ∪B) = P(A)P(B); otherwise, itemsets A
and B are dependent and correlated as events. This definition can easily be extended to
more than two itemsets. The lift between the occurrence of A and B can be measured by
computing.
If the resulting value of Equation is less than 1, then the
occurrence of A is negatively correlated with the
occurrence of B, meaning that the occurrence of one likely
leads to the absence of the other one. If the resulting
value is greater than 1, then A and B are positively correlated, meaning that the occurrence
of one implies the occurrence of the other. If the resulting value is equal to 1, then A and B
are independent and there is no correlation between them .
Example
Suppose we are interested in analyzing transactions at AllElectronics with respect to the
purchase of computer games and videos. Let game refer to the transactions containing
computer games, and video refer to those containing videos. Of the 10,000 transactions
analyzed, the data show that 6000 of the customer transactions included computer games,
while 7500 included videos, and 4000 included both computer games and videos.
we need to study how the two itemsets, A and B, are correlated. Let (game)’ refer to the
transactions that do not contain computer games, and (video)’ refer to those that do not
contain videos. The transactions can be summarized in a contingency table, as shown
From the table, we can see that the probability of purchasing a computer game is
P({game}) = 0.60, the probability of purchasing a video is P({video}) = 0.75, and the
probability of purchasing both is P({game, video}) = 0.40. By Eq. (6.8), the lift of Rule (6.6)
is P({game, video})/(P ({game}) × P({video})) = 0.40/(0.60 × 0.75) = 0.89. Because this
value is less than 1, there is a negative correlation between the occurrence of {game} and
{video}.
2.Chi-Square:
The second correlation measure that we study is the χ 2 measure.
A correlation relationship between two attributes, A and B, can be discovered by a χ 2 (chi-
square) test. Suppose A has c distinct values, namely a1,a2,...ac. B has r distinct values,
namely b1,b2,...br . The data tuples described by A and B can be shown as a contingency
table, with the c values of A making up the columns and the r values of B making up the
rows. Let (Ai ,Bj) denote the event that attribute A takes on value ai and attribute B takes
on value bj , that is, where (A = ai ,B = bj). Each and every possible (Ai ,Bj) joint event has
its own cell (or slot) in the table. The χ 2 value (also known as the Pearson χ 2 statistic) is
computed as:
where N is the number of data tuples, count(A = ai)is the number of tuples having value ai
for A, and count(B = bj) is the number of tuples having value bj for B. The sum in Equation
(2.9) is computed over all of the r ×c cells. Note that the cells that contribute the most to
the χ 2 value are those whose actual count is very different from that expected.
To compute the correlation using χ 2 analysis for nominal data, we need the observed
value and expected value (displayed in parenthesis) for each slot of the contingency table.
From the table, we can compute the χ 2 value as follows:
Because the χ 2 value is greater than 1, and the observed value of the slot (game, video) =
4000, which is less than the expected value of 4500, buying game and buying video are
negatively correlated.
3.A Comparison of Pattern Evaluation Measures (all confidence, max confidence,
Kulczynski, and cosine):