Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
316 views20 pages

Mining Frequent Patterns, Associations and Correlations: Basic Concepts and Methods

The document discusses frequent pattern mining and association rule mining. It defines frequent patterns as sets of items, subsequences or substructures that occur frequently together in a dataset. Association rule mining involves finding frequent itemsets that satisfy minimum support and confidence thresholds, and generating rules from these patterns. The challenges of potentially many frequent patterns are addressed through concepts like closed and maximal patterns that compress the representation. The Apriori algorithm is described as a seminal level-wise approach that leverages the Apriori property to prune the search space.

Uploaded by

Lakshmi Priya B
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
316 views20 pages

Mining Frequent Patterns, Associations and Correlations: Basic Concepts and Methods

The document discusses frequent pattern mining and association rule mining. It defines frequent patterns as sets of items, subsequences or substructures that occur frequently together in a dataset. Association rule mining involves finding frequent itemsets that satisfy minimum support and confidence thresholds, and generating rules from these patterns. The challenges of potentially many frequent patterns are addressed through concepts like closed and maximal patterns that compress the representation. The Apriori algorithm is described as a seminal level-wise approach that leverages the Apriori property to prune the search space.

Uploaded by

Lakshmi Priya B
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Mining Frequent patterns, Associations

and Correlations: Basic Concepts and


Methods
What Is Pattern Discovery?

• Patterns represent intrinsic and important properties of datasets


Frequent Patterns: A set of items, subsequences, or substructures that
occur frequently together (or strongly correlated) in a data set
Motivation examples:
1. What products were often purchased together?
Eg: Set of items milk and bread appear frequently together in a
transaction
2. What are the subsequent purchases after buying an iPad?
3. What kinds of DNA are sensitive to this new drug?
4. What word sequences likely form phrases in this corpus?
Pattern Discovery: Why Is It Important?
• Finding inherent regularities in a data set

• Foundation for many essential data mining tasks


• Association, correlation, and causality analysis
• Mining sequential, structural (e.g., sub-graph) patterns
• Pattern analysis in spatiotemporal, multimedia, time-series, and
stream data

• Classification: Discriminative pattern-based analysis


• Cluster analysis: Pattern-based subspace clustering
• Broad applications: Market basket analysis, cross-marketing, catalog
design, sale campaign analysis, Web log analysis, biological sequence
analysis
Market Basket Analysis
• Frequent itemset mining leads to the discovery of associations and
correlations among items in large transactional or relational data sets.
• Industries are interested in such pattern of data.
• Helps in many business-decision making processes such as
•To develop marketing strategies
•catalog design
•cross-marketing
•customer shopping behavior analysis.
• Market basket analysis: Process analyses customer buying habits
by finding association between the different items that customer
place in their “shopping baskets”
Market Basket Analysis
• The discovery help the retailers to get the insight which items are
frequently brought together
• Helps on design store layouts – frequently brought items placed
in close proximity
• Helps retailers to plan which product can be put on sale with
reduced price
• Consider universe as set of items available then
• Each item is represented using Boolean variable representing the
presence or absence of that item
• Basket can be represented by Boolean vector of values assigned
• Analysing the buying patterns reflect items the are frequently
purchased together
• Computer  anti_virus software [support -2%, confidence =60%]
From Frequent Itemsets to Association Rules
• The patterns can be represented as association rules.
• Support and confidence are two measures of rule interestingness.
• Association rules are considered interesting if satisfy minimum support and
confidence.
• Computer => software[support 2%,confidence= 60%]
• A support of 2% means that 2% of all the transactions and a confidence of
60% means that 60% of the customers who purchased a computer also
bought the software.
• Association rule: X => Y
• Support s: The probability that a transaction contains X => Y
• Support (X=>Y)=P(X ∪ Y)
• Confidence, c: The conditional probability that a transaction containing X
also contains Y
• c(X=>Y)=P(Y/X)=sup(X ∪ Y) / sup(X)
• Association rules are considered STRONG if they satisfy minimum support
and confidence threshold (set by users)
Basic Concepts: Frequent Patterns and
Association Rules
Transaction-id Items bought Itemset I = {i1, …, ik}
10 A, B, D D is a set of database transactions. Each
20 A, C, D transaction is associated with an identifier,
30 A, D, E called a TID
Let A be a set of items. A transaction T is said
40 B, E, F
to contain A if A ⊆ T. An association rule is an
50 B, C, D, E, F
implication of the form A ⇒ B, where A ⊂ I, B
⊂ I, A ≠ ∅, B ≠ ∅, and A ∩B = φ. The rule A ⇒ B
Let supmin = 50%, confmin = 50%
Find all the rules A ∪ B with minimum
Freq. Pat.: {A:3, B:3, D:4, E:3,
AD:3} support and confidence support, s, probability
Association rules: that a transaction contains A∪ B
A -> D (60%, 100%)
confidence, c, conditional probability that a
D ->A(60%, 75%)
transaction having X also contains Y
Data Mining: Concepts and
Techniques 6
Basic Concepts: Frequent Itemsets
(Patterns)
• Itemset: A set of one or more items
• k-itemset: X = {x1, …, xk}
• (absolute) support (count) of X:
Occurrence frequency of an itemset is the • Let minsup = 50% Freq. 1-
number of transactions that contain itemsets:
• Beer: 3 (60%); Nuts: 3 (60%)
itemset. • Diaper: 4 (80%); Eggs: 3
(60%) Freq. 2itemsets:
• c = support_count(X ∪
• {Beer, Diaper}: 3 (60%)
Y)/support_count(X)
• (relative) support, s: The probability that
a transaction contains P(A U B)
• c(X=>Y)=P(Y/X)=sup(X ∪ Y) / sup(X)
• An itemset X is frequent if the relative
support of X is no less than a minsup
Association Rule Mining

• Association rule mining can be viewed as two step process


• Find all frequent item sets
• Generate strong association rules from frequent item sets
• Finding all frequent item sets:
• Each of these item sets will occur at least as frequently as a
predetermined minimum support count, min_sup

• Generate strong association rules from the frequent item sets:


• These rules should satisfy minimum support and confidence
Challenge: There Are Too Many Frequent
Patterns!
• A long pattern contains a combinatorial number of shorter frequent
item sets.

How many frequent itemsets does the following TDB1 contain?


TDB1: T1: {a1, a2 …, a100} Assuming (absolute) minsup = 1

1itemsets: {a1}, {a2}….{a100} contains (100 )


1 A too huge
set for any
2itemsets: {a1, a2},{a1,a3 } … contains (100 2)
computer
100itemset: {a1, a2, …, a100}: 1 contains
1to
compute
or store!

2100 – 1 subpatterns! How to handle such a


challenge?
Closed and Maximal

Solution 1: Closed patterns:A pattern (itemset) X is closed in a dataset D if


X is frequent, and there exists no proper super- itemset Y such that Y has
the same support as X in D.
• Solution 2: Max-patterns:A pattern X is a max-pattern if X is frequent
and there exists no (immediate) super-itemset Y such that XСY and Y is
frequent
Expressing Patterns in Compressed
Form: Closed Patterns and Maxim
My dataset: 1:A,B,C,E 2:A,C,D,E, 3:B,C,E
4:A,C,D,E 5:C,D,E 6:A,D,E

{A} = 4 ; {B} = 2 ; {C} = 5 ; {D} = 4 ; {E} = 6


{A,B} = 1; {A,C} = 3; {A,D} = 3; {A,E} = 4; {B,C} = 2;
{B,D} = 0; {B,E} = 2; {C,D} = 3; {C,E} = 5; {D,E} = 3
{A,B,C} = 1; {A,B,D} = 0; {A,B,E} = 1; {A,C,D} = 2;
{A,C,E} = 3; {A,D,E} = 3; {B,C,D} = 0; {B,C,E} = 2;
{C,D,E} = 3
{A,B,C,D} = 0; {A,B,C,E} = 1; {B,C,D,E} = 0 Min_sup=0.5
Closed and Maximal-Example

• {A} = 4 ; not closed due to {A,E} not maximal


• {B} = 2 ; not closed due to {B,C} and {B,E} not maximal
• {C} = 5 ; not closed due to {C,E} and not maximal
• {D} = 4 ; closed, but not maximal due to e.g. {A,D}
and{C,D} and {D,E}
• {E} = 6 ; closed, but not maximal due to e.g. {D,E} and
{B,E}
• {A,C,E} = 3;closed and not maximal frequent due to
{A,B,C,E} = 1
• {C,D,E} = 3; closed and maximal frequent {B,C,D,E} = 0
Apriori Pruning and Scalable Mining
Methods
• Scalable mining Methods: Three major approaches
– Level-wise, join-based approach: Apriori
– Vertical data format approach
– Frequent pattern projection and growth
Apriori Pruning and Scalable Mining
Methods
• Apriori is a seminal algorithm, uses priori knowledge of frequent itemset
properties.
• Apriori employs level wise search where k itemsets are used to explore
(k+1) item sets

• The set of frequent 1 itemsets is found by following


• Scan the database to accumulate the count for each item
• Accumulate the items that satisfy minimum support . The resulting
set is denoted as L1.
• L1 is used to find L2 the set of frequent 2 item sets which used to
find L3 until no frequent item sets can be found
• To improve the efficiency of levelwise generation Apriori property
is used to reduce search space.
Apriori Property

• All nonempty subsets of a frequent itemsets must also frequent.


(or) If there is any itemset which is infrequent, its superset should
not be generated/tested!
• If an item does not support min_sup (P(I)<min_sup) then I is not
frequent
• If A is added to I then (I U A) cannot occur more frequent than I i.e, P
(IUA)<min_sup
• This property is called antimonotonicity: if a set cannot pass a test all
its superset will fail for the same test Algorithm make use of Apriori
property follows two-step process consisting of join and prune
actions
Join step for k>=2
• To find Lk : Generate a set of candidate k-itemsets by joining
Lk-1 by itself
• Set of candidates is denoted by Ck.
• Let I1 and I2 be itemsets in Lk-1

• Apriori assumes itemset are sorted in lexicographic order

• The join can be performed when Lk-1 ⋈ Lk-1, Lk-1 items are
joinable if their first (k-2 ) items are in common

• Members are joined if (l1[1]=l2[1] ^ l1[2]=l2[2] ^…..^(l1[k-


2]=l2[k-2]) ^ (l1[k-1]<l2[k-1])

• The resulting data set formed by joining I1 and I2 is {I1[1],I1[2],


• ……., I1[k-2],I1[k-1],l2[k-1]
Prune Step

• Initially, scan DB once to get frequent 1-itemset

• Let Ck is a superset of Lk , members of Ck may or may not be


frequent but all frequent k-itemsets are included in Ck .
• Data base scan done to determine the count of candidates in Ck.
• Apriori property is used any (k-1) itemset that is not frequent
cannot be subset of frequent k-itemset and so removed from Ck
• Subtest testing can be done using Hash tree.
Frequent itemsets-generation
Frequent itemsets-generation

You might also like