Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
22 views11 pages

DM Module 3

Gjjgcg vv

Uploaded by

abhiramsurya48
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views11 pages

DM Module 3

Gjjgcg vv

Uploaded by

abhiramsurya48
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

MODULE 3

MINING FREQUENT PATTERNS


Frequent patterns are patterns (e.g., itemsets, subsequences, or substructures) that appear
frequently in a data set. For example, a set of items, such as milk and bread, that appear
frequently together in a transaction data set is a frequent itemset. A subsequence, such as buying
first a PC, then a digital camera, and then a memory card, if it occurs frequently in a shopping
history database, is a (frequent) sequential pattern. A substructure can refer to different structural
forms, such as subgraphs, subtrees, or sublattices, which may be combined with itemsets or
subsequences. If a substructure occurs frequently, it is called a (frequent) structured pattern.
Finding frequent patterns plays an essential role in mining associations, correlations, and many
other interesting relationships among data.
Market Basket Analysis: A Motivating Example
A typical example of frequent itemset mining is market basket analysis. This process analyzes
customer buying habits by finding associations between the different items that customers place
in their “shopping baskets” (Figure 6.1). The discovery of these associations can help retailers
develop marketing strategies by gaining insight into which items are frequently purchased
together by customers. For instance, if customers are buying milk, how likely are they to also
buy bread (and what kind of bread) on the same trip to the supermarket? This information can
lead to increased sales by helping retailers do selective marketing and plan their shelf space.
Rule support and confidence are two measures of rule interestingness. They respectively reflect
the usefulness and certainty of discovered rules. A support of 2% for above Rule means that 2%
of all the transactions under analysis show that computer and antivirus software are purchased
together. A confidence of 60% means that 60% of the customers who purchased a computer also
bought the software. Typically, association rules are considered interesting if they satisfy both a
minimum support threshold and a minimum confidence threshold. These thresholds can be a set
by users or domain experts.
Frequent Itemsets, Closed Itemsets, and Association Rules
Support(A=>B)=P(AUB)
Confidence(A=>B)=P(A\B)
Rules that satisfy both a minimum support threshold (min sup) and a minimum confidence
threshold (min conf ) are called strong. A set of items is referred to as an itemset. An itemset that
contains k items is a k-itemset. The set fcomputer, antivirus softwareg is a 2-itemset. The
occurrence frequency of an itemset is the number of transactions that contain the itemset. This is
also known, simply, as the frequency, support count, or count of the itemset.

In general, association rule mining can be viewed as a two-step process:


1. Find all frequent itemsets: By definition, each of these itemsets will occur at least as
frequently as a predetermined minimum support count, min sup.
2. Generate strong association rules from the frequent itemsets: By definition, these rules must
satisfy minimum support and minimum confidence.

Frequent Itemset Mining Methods


Apriori Algorithm: Finding Frequent Itemsets by Confined Candidate Generation
Apriori is a seminal algorithm proposed by R. Agrawal and R. Srikant in 1994 for mining frequent
itemsets for Boolean association rules [AS94b]. The name of the algorithm is based on the fact that the
algorithm uses prior knowledge of frequent itemset properties, as we shall see later. Apriori employs an
iterative approach known as a level-wise search, where k-itemsets are used to explore (k +1)- itemsets.
First, the set of frequent 1-itemsets is found by scanning the database to accumulate the count for each
item, and collecting those items that satisfy minimum support. The resulting set is denoted by L1. Next,
L1 is used to find L2, the set of frequent 2- itemsets, which is used to find L3, and so on, until no more
frequent k-itemsets can be found. The finding of each Lk requires one full scan of the database. To
improve the efficiency of the level-wise generation of frequent itemsets, an important property called the
Apriori property is used to reduce the search space.

Let’s look at a concrete example, based on the AllElectronics transaction database, D, of Table . There are
nine transactions in this database, that is, |D| = 9. We use Figure 6.2 to illustrate the Apriori algorithm for
finding frequent itemsets in D.
1. In the first iteration of the algorithm, each item is a member of the set of candidate 1-itemsets, C1. The
algorithm simply scans all of the transactions to count the number of occurrences of each item.
2. Suppose that the minimum support count required is 2, that is, min sup = 2. (Here, we are referring to
absolute support because we are using a support count. The corresponding relative support is 2/9 = 22%.)
The set of frequent 1-itemsets, L1, can then be determined. It consists of the candidate 1-itemsets
satisfying minimum support. In our example, all of the candidates in C1 satisfy minimum support.
3. To discover the set of frequent 2-itemsets, L2, the algorithm uses the join L1 L1 to generate a
candidate set of 2-itemsets, C2. C2 consists of (|L1|C2) 2-itemsets. Note that no candidates are removed
from C2 during the prune step because each subset of the candidates is also frequent.
4. Next, the transactions in D are scanned and the support count of each candidate itemset in C2 is
accumulated, as shown in the middle table of the second row in Figure.
5. The set of frequent 2-itemsets, L2, is then determined, consisting of those candidate 2-itemsets in C2
having minimum support.
6. The generation of the set of the candidate 3-itemsets, C3, is detailed in Figure. From the join step, we
first get C3 =L2 L2 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5},{I2, I3, I4}, {I2, I3, I5}, {I2, I4, I5}}.
7. The transactions in D are scanned to determine L3, consisting of those candidate 3-itemsets in C3
having minimum support (Figure).
8. The algorithm uses L3 1 L3 to generate a candidate set of 4-itemsets, C4. Although the join results in
{{I1, I2, I3, I5}}, itemset {I1, I2, I3, I5} is pruned because its subset {I2, I3, I5} is not frequent. Thus, C4
= , and the algorithm terminates, having found all of the frequent itemsets.
FP Growth Algorithm
The FP-Growth Algorithm is an alternative way to find frequent item sets without using
candidate generations, thus improving performance. For so much, it uses a divide-and-conquer
strategy. The core of this method is the usage of a special data structure named frequent-pattern
tree (FP-tree), which retains the item set association information.
Using this strategy, the FP-Growth reduces the search costs by recursively looking for short
patterns and then concatenating them into the long frequent patterns.

In large databases, holding the FP tree in the main memory is impossible. A strategy to cope with
this problem is to partition the database into a set of smaller databases (called projected
databases) and then construct an FP-tree from each of these smaller databases.

FP-Tree

The frequent-pattern tree (FP-tree) is a compact data structure that stores quantitative
information about frequent patterns in a database. Each transaction is read and then mapped onto
a path in the FP-tree. This is done until all transactions have been read. Different transactions
with common subsets allow the tree to remain compact because their paths overlap.

A frequent Pattern Tree is made with the initial item sets of the database. The purpose of the FP
tree is to mine the most frequent pattern. Each node of the FP tree represents an item of the item
set.

The root node represents null, while the lower nodes represent the item sets. The associations of
the nodes with the lower nodes, that is, the item sets with the other item sets, are maintained
while forming the tree.

Han defines the FP-tree as the tree structure given below:

1. One root is labelled as "null" with a set of item-prefix subtrees as children and a frequent-
item-header table.
2. Each node in the item-prefix subtree consists of three fields:
o Item-name: registers which item is represented by the node;
o Count: the number of transactions represented by the portion of the path reaching
the node;
o Node-link: links to the next node in the FP-tree carrying the same item name or
null if there is none.
3. Each entry in the frequent-item-header table consists of two fields:
o Item-name: as the same to the node;
o Head of node-link: a pointer to the first node in the FP-tree carrying the item
name.

Additionally, the frequent-item-header table can have the count support for an item. The below
diagram is an example of a best-case scenario that occurs when all transactions have the same
itemset; the size of the FP-tree will be only a single branch of nodes.

The worst-case scenario occurs when every transaction has a unique item set. So the space
needed to store the tree is greater than the space used to store the original data set because the
FP-tree requires additional space to store pointers between nodes and the counters for each item.
The diagram below shows how a worst-case scenario FP-tree might appear. As you can see, the
tree's complexity grows with each transaction's uniqueness.
Using this algorithm, the FP-tree is constructed in two database scans. The first scan collects and
sorts the set of frequent items, and the second constructs the FP-Tree.

Example

Support threshold=50%, Confidence= 60%

Table 1:

Transaction List of items

T1 I1,I2,I3

T2 I2,I3,I4

T3 I4,I5

T4 I1,I2,I4

T5 I1,I2,I3,I5

T6 I1,I2,I3,I4

Solution: Support threshold=50% => 0.5*6= 3 => min_sup=3

Table 2: Count of each item

Item Count

I1 4

I2 5

I3 4

I4 4

I5 2
Table 3: Sort the itemset in descending order.

Item Count

I2 5

I1 4

I3 4

I4 4

Build FP Tree

Let's build the FP tree in the following steps, such as:

1. Considering the root node null.


2. The first scan of Transaction T1: I1, I2, I3 contains three items {I1:1}, {I2:1}, {I3:1},
where I2 is linked as a child, I1 is linked to I2 and I3 is linked to I1.
3. T2: I2, I3, and I4 contain I2, I3, and I4, where I2 is linked to root, I3 is linked to I2 and I4
is linked to I3. But this branch would share the I2 node as common as it is already used in
T1.
4. Increment the count of I2 by 1, and I3 is linked as a child to I2, and I4 is linked as a child
to I3. The count is {I2:2}, {I3:1}, {I4:1}.
5. T3: I4, I5. Similarly, a new branch with I5 is linked to I4 as a child is created.
6. T4: I1, I2, I4. The sequence will be I2, I1, and I4. I2 is already linked to the root node.
Hence it will be incremented by 1. Similarly I1 will be incremented by 1 as it is already
linked with I2 in T1, thus {I2:3}, {I1:2}, {I4:1}.
7. T5:I1, I2, I3, I5. The sequence will be I2, I1, I3, and I5. Thus {I2:4}, {I1:3}, {I3:2},
{I5:1}.
8. T6: I1, I2, I3, I4. The sequence will be I2, I1, I3, and I4. Thus {I2:5}, {I1:4}, {I3:3}, {I4
1}.
MINING ASSOCIATION RULES

Association rule learning is a type of unsupervised learning technique that checks for the
dependency of one data item on another data item and maps accordingly so that it can be more
profitable. It tries to find some interesting relations or associations among the variables of
dataset. It is based on different rules to discover the interesting relations between variables in the
database.

The association rule learning is one of the very important concepts of machine learning, and it is
employed in Market Basket analysis, Web usage mining, continuous production, etc. Here
market basket analysis is a technique used by the various big retailer to discover the associations
between items. We can understand it by taking an example of a supermarket, as in a
supermarket, all products that are purchased together are put together.Consider the below
diagram:
Types of association rules in data mining

There are multiple types of association rules in data mining. They include the following:

 Generalized. Rules in this category are meant to be general examples of association rules that
provide a high-level overview of what these associations of data points look like.

 Multilevel. Multilevel association rules separate data points into different levels of importance
-- known as levels of abstraction -- and distinguish between associations of more important
data points and ones of lower importance.

 Quantitative. This type of association rule is used to describe examples where associations
are made between numerical data points.

 Multi relational. This type is more in-depth than traditional association rules that consider
relationships between single data points. Multirelational rules are made across multiple
or multidimensional databases.

You might also like