0% found this document useful (0 votes)

5 views14 pages

Module 2

Uploaded by

yakkalurusrujan475

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views14 pages

Module 2

Uploaded by

yakkalurusrujan475

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 14

MINING FREQUENT PATTERNS ASSOCIATIONS AND

CLASSIFICATION
# Basic Concepts of and association rules:
Frequent patterns are patterns (e.g., itemsets, subsequences, or substructures) that appear
frequently in a data set. For example, a set of items, such as milk and bread, that appear
frequently together in a transaction data set is a frequent itemset.

A subsequence, such as buying first a PC, then a digital camera, and then a memory card, if
it occurs frequently in a shopping history database, is a (frequent) sequential pattern.

A substructure can refer to different structural forms, such as subgraphs, subtrees, or

sublattices, which may be combined with itemsets or subsequences. If a substructure
occurs frequently, it is called a (frequent) structured pattern.

Market Basket Analysis: A Motivating Example

Frequent itemset mining leads to the discovery of associations and correlations among
items in large transactional or relational data sets. A typical example of frequent itemset
mining is market basket analysis. This process analyzes customer buying habits by finding
associations between the different items that customers place in their “shopping baskets”.

For example, the information that

customers who purchase
computers also tend to buy
antivirus software at the same
time is represented in the
following association rule:
computer ⇒ antivirus software
[support = 2%,confidence = 60%].

Rule support and confidence are

two measures of rule
interestingness. A support of 2%
for Rule means that 2% of all the
transactions under analysis show
that computer and antivirus
software are purchased together. A confidence of 60% means that 60% of the customers
who purchased a computer also bought the software. Typically, association rules are
considered interesting if they satisfy both a minimum support threshold and a minimum
confidence threshold.
Frequent Itemsets, Closed Itemsets, and Association Rules

Let I = {I1, I2,..., Im} be an itemset. Let D, the task-relevant data, be a set of database
transactions where each transaction T is a nonempty itemset such that T ⊆ I. Each
transaction is associated with an identifier, called a TID. Let A be a set of items. An
association rule is an implication of the form A ⇒ B, where A ⊂ I, B ⊂ I, A ≠ ∅, B ≠ ∅, and A
∩B = ∅.

Rules that satisfy both a minimum support

threshold (min sup) and a minimum confidence
threshold (min conf ) are called strong.

An (frequent) itemset is called closed if it has no (frequent) superset having the same
support.

In general, association rule mining can be viewed as a two-step process:

1. Find all frequent itemsets: By definition, each of these itemsets will occur at least
as frequently as a predetermined minimum support count, min sup.

2. Generate strong association rules from the frequent itemsets: By definition, these
rules must satisfy minimum support and minimum confidence.

#Frequent itemset Mining Methods

1.Apriori Algorithm: Finding Frequent Itemsets

Apriori algorithm proposed by R. Agrawal and R. Srikant in 1994 for mining frequent
itemsets for Boolean association rules. The name of the algorithm is based on the fact that
the algorithm uses prior knowledge of frequent itemset properties.

Apriori employs an iterative approach known as a level-wise search, where k-itemsets are
used to explore (k + 1)-itemsets.

First, the set of frequent 1-itemsets is found by scanning the database to accumulate the
count for each item, and collecting those items that satisfy minimum support. The resulting
set is denoted by L1. Next, L1 is used to find L2, the set of frequent 2-itemsets, which is
used to find L3, and so on, until no more frequent k-itemsets can be found. The finding of
each Lk requires one full scan of the database.
To improve the efficiency of the level-wise generation of frequent itemsets, an important
property called the Apriori property is used to reduce the search space.

Apriori property: All nonempty subsets of a frequent itemset must also be frequent.

The Apriori property is based on the following observation. By definition, if an itemset I

does not satisfy the minimum support threshold, then I is not frequent, that is, P(I) < min
sup. If an item A is added to the itemset I, then the resulting itemset (i.e., I ∪A) cannot
occur more frequently than I. Therefore, I ∪A is not frequent either, that is, P(I ∪A) < min
sup. This property belongs to a special category of properties called antimonotonicity in the
sense that if a set cannot pass a test, all of its supersets will fail the same test as
well.

Example: Generation of the candidate itemsets and frequent itemsets, where the
minimum support count is 2.
1. In the first iteration of the algorithm, each item is a member of the set of candidate
1-itemsets, C1. The algorithm simply scans all of the transactions to count the
number of occurrences of each item.

2. Suppose that the minimum support count required is 2, that is, min sup = 2. The set
of frequent 1-itemsets, L1, can then be determined. In our example, all of the
candidates in C1 satisfy minimum support.

3. To discover the set of frequent 2-itemsets, L2, the algorithm uses the join L1 ⋈ L1 to
generate a candidate set of 2-itemsets, C2.

4. Next, the transactions in D are scanned and the support count of each candidate
itemset in C2 is accumulated.

5. The set of frequent 2-itemsets, L2, is then determined, consisting of those candidate
2-itemsets in C2 having minimum support.

6. The generation of the set of the candidate 3-itemsets, C3.

7. The transactions in D are scanned to determine L3, consisting of those candidate 3-
itemsets in C3 having minimum support.

8. The algorithm uses L3 ⋈ L3 to generate a candidate set of 4-itemsets, C4. Although

the join results in {{I1, I2, I3, I5}}, itemset {I1, I2, I3, I5} is pruned because its
subset {I2, I3, I5} is not frequent. Thus, C4 = ∅, and the algorithm terminates,
having found all of the frequent itemsets.
2.Generating Association Rules from Frequent Itemsets

Once the frequent itemsets from transactions in a database D have been found, it is
straightforward to generate strong association rules from them.

The conditional probability is expressed in terms of itemset support count, where support
count(A ∪B) is the number of transactions containing the itemsets A ∪B, and support
count(A) is the number of transactions containing the itemset A. Association rules can be
generated as follows:

 For each frequent itemset l, generate all nonempty subsets of l.

 For every nonempty subset s of l, output the rule “s ⇒ (l − s)” if (support count(l)/
support count(s)) ≥ min_conf.

Example Generating association rules. Let’s try an example based on the transactional data
for AllElectronics. The data contain frequent itemset X = {I1, I2, I5}. The association rules
that can be generated from X are nonempty subsets {I1, I2}, {I1, I5}, {I2, I5}, {I1}, {I2},
and {I5}. The resulting association rules with its confidence:

If the minimum confidence threshold is, say,

70%, then only the second, third, and last
rules are output, because these are the only ones generated that are strong.

3.Improving the Efficiency of Aprior

Many variations of the Apriori algorithm have been proposed that focus on improving the
efficiency of the original algorithm. Several of these variations are summarized as follows:

Hash-based technique (hashing itemsets into corresponding buckets): A hash-

based technique can be used to reduce the size of the candidate k-itemsets, Ck , for
k > 1.

Transaction reduction (reducing the number of transactions scanned in future

iterations): A transaction that does not contain any frequent k-itemsets cannot
contain any frequent (k + 1)-itemsets. Therefore, such a transaction can be marked
or removed from further consideration.

Partitioning (partitioning the data to find candidate itemsets): A partitioning

technique can be used that requires just two database scans to mine the frequent
itemsets. It consists of two phases. In phase I, the algorithm divides the transactions
of D into n nonoverlapping partitions. In phase II, a second scan of D is conducted in
which the actual support of each candidate is assessed to determine the global
frequent itemsets.

Sampling (mining on a subset of the given data): The basic idea of the sampling
approach is to pick a random sample S of the given data D, and then search for
frequent itemsets in S instead of D.

4.FP-Growth Approach for Mining Frequent Itemsets

Frequent pattern growth, or simply FP-growth, which adopts a divide-and-conquer strategy

for Mining Frequent Itemsets. First, it compresses the database representing frequent items
into a frequent pattern tree, or FP-tree, which retains the itemset association information. It
then divides the compressed database into a set of conditional databases (a special kind of
projected database), each associated with one frequent item or “pattern fragment,” and
mines each database separately. For each “pattern fragment,” only its associated data sets
need to be examined.

Example FP-growth (finding frequent itemsets without candidate generation). We

reexamine the mining of transaction database, D using the frequent pattern growth
approach. The first scan of the database is the same as Apriori, which derives the set of
frequent items (1-itemsets) and their support counts (frequencies).

Let the minimum support count be 2. The set of frequent items is sorted in the order of
descending support count. This resulting set or list is denoted by L. Thus, we have L = {{I2:
7}, {I1: 6}, {I3: 6}, {I4: 2}, {I5: 2}}.

An FP-tree is then constructed as follows. First, create the root of the tree, labeled with
“null.” Scan database D a second time. The items in each transaction are processed in L
order (i.e., sorted according to descending support count), and a branch is created for each
transaction. For example, the scan of the first transaction, “T100: I1, I2, I5,” which contains
three items (I2, I1, I5 in L order), leads to the construction of the first branch of the tree
with three nodes, <I2: 1>, <I1: 1>, and <I5: 1>, where I2 is linked as a child to the root, I1
is linked to I2, and I5 is linked to I1.

The second transaction, T200, contains the items I2 and I4 in L order, which would result in
a branch where I2 is linked to the root and I4 is linked to I2. However, this branch would
share a common prefix, I2, with the existing path for T100. Therefore, we instead increment
the count of the I2 node by 1, and create a new node, <I4: 1>, which is linked as a child to
<I2: 2>.
5.Mining Frequent Itemsets Using the Vertical Data Format

Both the Apriori and FP-growth methods mine frequent patterns from a set of transactions
in TID-itemset format (i.e., {TID : itemset}), where TID is a transaction ID and itemset is the
set of items bought in transaction TID. This is known as the horizontal data format.
Alternatively, data can be presented in item-TID set format (i.e., {item : TID set}), where
item is an item name, and TID set is the set of transaction identifiers containing the item.
This is known as the vertical data format. Frequent itemsets can also be mined efficiently
using vertical data format, which is the essence of the Eclat (Equivalence Class
Transformation) algorithm.

Example
#Pattern evaluation methods- From Association Mining to
Correlation Analysis
A correlation measure can be used to augment the support–confidence framework for
association rules. This leads to correlation rules of the form:

A ⇒ B [support, confidence, correlation]

That is, a correlation rule is measured not only by its support and confidence but also by
the correlation between itemsets A and B. There are many different correlation measures
from which to choose.

1.Lift:

Lift is a simple correlation measure that is given as follows. The occurrence of itemset A is
independent of the occurrence of itemset B if P(A ∪B) = P(A)P(B); otherwise, itemsets A
and B are dependent and correlated as events. This definition can easily be extended to
more than two itemsets. The lift between the occurrence of A and B can be measured by
computing.

If the resulting value of Equation is less than 1, then the

occurrence of A is negatively correlated with the
occurrence of B, meaning that the occurrence of one likely
leads to the absence of the other one. If the resulting
value is greater than 1, then A and B are positively correlated, meaning that the occurrence
of one implies the occurrence of the other. If the resulting value is equal to 1, then A and B
are independent and there is no correlation between them .
Example
Suppose we are interested in analyzing transactions at AllElectronics with respect to the
purchase of computer games and videos. Let game refer to the transactions containing
computer games, and video refer to those containing videos. Of the 10,000 transactions
analyzed, the data show that 6000 of the customer transactions included computer games,
while 7500 included videos, and 4000 included both computer games and videos.

we need to study how the two itemsets, A and B, are correlated. Let (game)’ refer to the
transactions that do not contain computer games, and (video)’ refer to those that do not
contain videos. The transactions can be summarized in a contingency table, as shown

From the table, we can see that the probability of purchasing a computer game is
P({game}) = 0.60, the probability of purchasing a video is P({video}) = 0.75, and the
probability of purchasing both is P({game, video}) = 0.40. By Eq. (6.8), the lift of Rule (6.6)
is P({game, video})/(P ({game}) × P({video})) = 0.40/(0.60 × 0.75) = 0.89. Because this
value is less than 1, there is a negative correlation between the occurrence of {game} and
{video}.
2.Chi-Square:

The second correlation measure that we study is the χ 2 measure.

A correlation relationship between two attributes, A and B, can be discovered by a χ 2 (chi-
square) test. Suppose A has c distinct values, namely a1,a2,...ac. B has r distinct values,
namely b1,b2,...br . The data tuples described by A and B can be shown as a contingency
table, with the c values of A making up the columns and the r values of B making up the
rows. Let (Ai ,Bj) denote the event that attribute A takes on value ai and attribute B takes
on value bj , that is, where (A = ai ,B = bj). Each and every possible (Ai ,Bj) joint event has
its own cell (or slot) in the table. The χ 2 value (also known as the Pearson χ 2 statistic) is
computed as:

where N is the number of data tuples, count(A = ai)is the number of tuples having value ai
for A, and count(B = bj) is the number of tuples having value bj for B. The sum in Equation
(2.9) is computed over all of the r ×c cells. Note that the cells that contribute the most to
the χ 2 value are those whose actual count is very different from that expected.

To compute the correlation using χ 2 analysis for nominal data, we need the observed
value and expected value (displayed in parenthesis) for each slot of the contingency table.
From the table, we can compute the χ 2 value as follows:

Because the χ 2 value is greater than 1, and the observed value of the slot (game, video) =
4000, which is less than the expected value of 4500, buying game and buying video are
negatively correlated.

3.A Comparison of Pattern Evaluation Measures (all confidence, max confidence,

Kulczynski, and cosine):

Association Analysis: Unit-V
No ratings yet
Association Analysis: Unit-V
12 pages
Professional Education-Curriculum Development (Let Reviewer)
100% (11)
Professional Education-Curriculum Development (Let Reviewer)
7 pages
DWDM Unit IV Mining - FP Association Rules
No ratings yet
DWDM Unit IV Mining - FP Association Rules
82 pages
Data Mining Unit-Ii Notes
No ratings yet
Data Mining Unit-Ii Notes
24 pages
Inbound 5799672056943946753
No ratings yet
Inbound 5799672056943946753
47 pages
Unit 2-2
No ratings yet
Unit 2-2
53 pages
Mining Frequent Itemset-Association Analysis
No ratings yet
Mining Frequent Itemset-Association Analysis
59 pages
Data Mining: Frequent Itemsets & Clustering
No ratings yet
Data Mining: Frequent Itemsets & Clustering
152 pages
Mining Frequent Patterns Ubnit 3
No ratings yet
Mining Frequent Patterns Ubnit 3
25 pages
Emotional Intelligence Brochure PLI
100% (1)
Emotional Intelligence Brochure PLI
2 pages
Data Analytics Unit 4
No ratings yet
Data Analytics Unit 4
22 pages
Equent Patterns
No ratings yet
Equent Patterns
74 pages
Chapter 5 Mining Frequent Pattern-DWM
No ratings yet
Chapter 5 Mining Frequent Pattern-DWM
48 pages
Report of 2nd Defence
No ratings yet
Report of 2nd Defence
6 pages
Unit 4
No ratings yet
Unit 4
97 pages
Association RuleMining
No ratings yet
Association RuleMining
52 pages
Lecture Notes Session-2
No ratings yet
Lecture Notes Session-2
4 pages
FDS Unit02
No ratings yet
FDS Unit02
16 pages
An Efficient Algorithm For Mining
No ratings yet
An Efficient Algorithm For Mining
6 pages
Data Mining Notes UNIT III
No ratings yet
Data Mining Notes UNIT III
26 pages
Unit 2 - Apriori and FP Growth Algortithm
No ratings yet
Unit 2 - Apriori and FP Growth Algortithm
15 pages
Data Mining for Business Insights
No ratings yet
Data Mining for Business Insights
7 pages
Frequent Pattern Mining With Associations: Lesson Introduction
No ratings yet
Frequent Pattern Mining With Associations: Lesson Introduction
6 pages
Mod 5
No ratings yet
Mod 5
56 pages
Mining Frequent Patterns and Associations
No ratings yet
Mining Frequent Patterns and Associations
52 pages
06 FPBasic
No ratings yet
06 FPBasic
69 pages
Mining Frequent Patterns, Association and Correlations - Basic Concepts and Methods
No ratings yet
Mining Frequent Patterns, Association and Correlations - Basic Concepts and Methods
55 pages
DWDM Module III
No ratings yet
DWDM Module III
33 pages
Apriori Algorithm for Association Rule Mining
No ratings yet
Apriori Algorithm for Association Rule Mining
32 pages
Unsupervised Learning Essentials
No ratings yet
Unsupervised Learning Essentials
64 pages
DWDM Unit 4
No ratings yet
DWDM Unit 4
12 pages
Association Rule Mining Guide
No ratings yet
Association Rule Mining Guide
16 pages
Association Rule Mod 3
No ratings yet
Association Rule Mod 3
28 pages
Association-Analysis
No ratings yet
Association-Analysis
72 pages
Association Rule Mining
No ratings yet
Association Rule Mining
19 pages
UNIT-3 DM
No ratings yet
UNIT-3 DM
9 pages
DMDW - Association Analysis
No ratings yet
DMDW - Association Analysis
12 pages
DWM Unit 4
No ratings yet
DWM Unit 4
11 pages
DM Module 3
No ratings yet
DM Module 3
11 pages
CH-4 Mining Association Rules
No ratings yet
CH-4 Mining Association Rules
35 pages
Unit 3 1
No ratings yet
Unit 3 1
34 pages
Association Rules
No ratings yet
Association Rules
48 pages
Retail Market Basket Analysis
No ratings yet
Retail Market Basket Analysis
43 pages
DM Lect7
No ratings yet
DM Lect7
26 pages
Module 2
No ratings yet
Module 2
13 pages
Fundamentals of Data Science Unit 5
No ratings yet
Fundamentals of Data Science Unit 5
25 pages
Beltscale Handbook 03 12 TL
No ratings yet
Beltscale Handbook 03 12 TL
8 pages
Ijctt V27P116
No ratings yet
Ijctt V27P116
7 pages
Witsoc Reviewer
No ratings yet
Witsoc Reviewer
11 pages
What Is A Frequent Itemset?
No ratings yet
What Is A Frequent Itemset?
7 pages
Apriori Algorithm in Data Mining
No ratings yet
Apriori Algorithm in Data Mining
23 pages
Mining Frequent Patterns, Associations and Correlations: Basic Concepts and Methods
No ratings yet
Mining Frequent Patterns, Associations and Correlations: Basic Concepts and Methods
20 pages
Chapter 5 Data Mining: Dr. Huma Lone
No ratings yet
Chapter 5 Data Mining: Dr. Huma Lone
56 pages
I. Review Questions Chapter 4: Mining Frequent Patterns, Associations, Ad Corelations
No ratings yet
I. Review Questions Chapter 4: Mining Frequent Patterns, Associations, Ad Corelations
19 pages
Data Mining Unit 2 1
No ratings yet
Data Mining Unit 2 1
15 pages
Volume 2, No. 5, April 2011 Journal of Global Research in Computer Science Research Paper Available Online at WWW - Jgrcs.info
No ratings yet
Volume 2, No. 5, April 2011 Journal of Global Research in Computer Science Research Paper Available Online at WWW - Jgrcs.info
3 pages
Apriori Algorithm Example PDF
No ratings yet
Apriori Algorithm Example PDF
7 pages
Module5 DMW
No ratings yet
Module5 DMW
13 pages
Data Mining Association Rules
No ratings yet
Data Mining Association Rules
54 pages
Contents
No ratings yet
Contents
59 pages
Substation
No ratings yet
Substation
10 pages
Ebook Monitoring Can Help Make Tailings Dams Safer
No ratings yet
Ebook Monitoring Can Help Make Tailings Dams Safer
17 pages
30GX
0% (1)
30GX
12 pages
Mining Frequent, Patterns, Associations, and Correlations
No ratings yet
Mining Frequent, Patterns, Associations, and Correlations
13 pages
演讲技巧与主题选择
100% (1)
演讲技巧与主题选择
6 pages
Manual Instruction CPAM-EKA AIR C16 EKA KOOL V2
No ratings yet
Manual Instruction CPAM-EKA AIR C16 EKA KOOL V2
8 pages
Viviane Namaste - Undoing Theory
No ratings yet
Viviane Namaste - Undoing Theory
23 pages
Create Gantt Chart and Cash Flow Using Excel With A File
No ratings yet
Create Gantt Chart and Cash Flow Using Excel With A File
6 pages
Science 9 Projectile Module
0% (1)
Science 9 Projectile Module
7 pages
A+ Guide To Managing and Maintaining Your PC, 6e: Motherboards
100% (1)
A+ Guide To Managing and Maintaining Your PC, 6e: Motherboards
36 pages
GEZE - Product Data Sheet - EN - 697800130822
No ratings yet
GEZE - Product Data Sheet - EN - 697800130822
3 pages
Efficient Algorithm for Closed Itemsets
No ratings yet
Efficient Algorithm for Closed Itemsets
8 pages
Generative Ai-In-The-Loop: Integrating Llms and Gpts Into The Next Generation Networks
No ratings yet
Generative Ai-In-The-Loop: Integrating Llms and Gpts Into The Next Generation Networks
9 pages
Scaffolding in Learning
No ratings yet
Scaffolding in Learning
5 pages
Electron Mi Kro Skop
No ratings yet
Electron Mi Kro Skop
4 pages
MiniROVER Data Sheet 2013 Lo 1
No ratings yet
MiniROVER Data Sheet 2013 Lo 1
2 pages
Definition, Health by WHO
No ratings yet
Definition, Health by WHO
13 pages
TPS6106x Constant Current LED Driver With Digital and PWM Brightness Control
No ratings yet
TPS6106x Constant Current LED Driver With Digital and PWM Brightness Control
29 pages
Biomimetics 06 00027 v3
No ratings yet
Biomimetics 06 00027 v3
16 pages
National Conference Hybrid
No ratings yet
National Conference Hybrid
5 pages
System and Communication
No ratings yet
System and Communication
9 pages
Disbursement Voucher
No ratings yet
Disbursement Voucher
1 page
BSBCRT511 Task 3 Assessment Templates V3.0923
No ratings yet
BSBCRT511 Task 3 Assessment Templates V3.0923
10 pages
The Construction of Family in Selected Disney Animated Films
No ratings yet
The Construction of Family in Selected Disney Animated Films
4 pages
Teaching Tools for Parsing Education
No ratings yet
Teaching Tools for Parsing Education
5 pages
Module 11
No ratings yet
Module 11
5 pages
Audit of Shareholder's Equity (Roque) PDF
No ratings yet
Audit of Shareholder's Equity (Roque) PDF
1 page
Graph Analysis for Scientists
No ratings yet
Graph Analysis for Scientists
5 pages

Module 2

Uploaded by

Module 2

Uploaded by

MINING FREQUENT PATTERNS ASSOCIATIONS AND

A substructure can refer to different structural forms, such as subgraphs, subtrees, or

Market Basket Analysis: A Motivating Example

For example, the information that

Rule support and confidence are

Rules that satisfy both a minimum support

In general, association rule mining can be viewed as a two-step process:

#Frequent itemset Mining Methods

The Apriori property is based on the following observation. By definition, if an itemset I

6. The generation of the set of the candidate 3-itemsets, C3.

8. The algorithm uses L3 ⋈ L3 to generate a candidate set of 4-itemsets, C4. Although

 For each frequent itemset l, generate all nonempty subsets of l.

If the minimum confidence threshold is, say,

3.Improving the Efficiency of Aprior

Hash-based technique (hashing itemsets into corresponding buckets): A hash-

Transaction reduction (reducing the number of transactions scanned in future

Partitioning (partitioning the data to find candidate itemsets): A partitioning

4.FP-Growth Approach for Mining Frequent Itemsets

Frequent pattern growth, or simply FP-growth, which adopts a divide-and-conquer strategy

Example FP-growth (finding frequent itemsets without candidate generation). We

A ⇒ B [support, confidence, correlation]

If the resulting value of Equation is less than 1, then the

The second correlation measure that we study is the χ 2 measure.

3.A Comparison of Pattern Evaluation Measures (all confidence, max confidence,

You might also like