BIDE : Efficient Mining of Frequent Closed Sequences
Jianyong Wang and Jiawei Han Proc. 2004 Int. Conf. on Data Engineering (ICDE'04), Boston, MA, March 2004.
Opening
{,,} sup = 5 {,,,} sup = 5 sup < 5 sup > 5
{, ,} sup = 5 {,, ,} sup = 5
Review of Algorithms
Problem was first introduced by Agrawal and Srikant 1995. Mostly for mining frequent sequences (closed and non-closed). frequent sequences :
Apriori-based (Apriori, AprioriAll, GSP) Sequence-Enumeration Tree/Lattice-based (Max-Miner, Spam, Spade) Constraint-based (SPIRIT Family)
frequent closed sequences :
CloSpan : The first algorithm for mining frequent closed sequences (Based on Prefix-Span). BIDE
Apriori-based algorithms
Make multiple passes over the database.
First Pass:
Find the 1-frequent sequences Join them to build the candidate 2-sequences
For step k:
Join the set of frequent (k-1)-sequences Scan the SDB to check which of the k-sequences are actually frequent
Stop at step m:
No frequent m-sequence is found
Tree-based algorithms (1/2)
Use a sequence enumeration tree to generate all the candidate sequences. Traverse the tree (DFS). Use pruning techniques to avoid traversing sub-trees.
Apriori-principle: If a sequence is non-frequent, its supersequences can not be frequent.
If an extension event leads to a non-frequent sequence, this event is no longer used in other sequence extensions.
Tree-based algorithms (2/2)
I = {a, b} S-step:
add item to sequence
I-step: add item to event
Definition of Terms (1/2)
Items set: I = {i1, i2, ,in} Ordered Events set (Sequence) : S = {e1, e2, , em} Length : m-sequence Subsequence : Sa = {a1, a2, , an} Super-sequence : Sb = {b1, b2, , bm}
a1= bi1, a2 = bi2, , an = bin 1 i1 <i2<< in m
Definition of Terms (2/2)
SDB : sequence database tuple (sid, S) S : is a subsequence of S. supSDB(Sa) : Absolute support supSDB(Sa) / |SDB| : Relative support min_sup : supSDB(Sa) min_sup
S is closed: if there exists no other super-sequence with the same support.
Frequent Closed Sequences
Find the complete set of frequent closed sequences :
SDB
I = { A, B, C } min_sup = 50%
Previous Algorithms : 17 Frequent Sequences
BIDE (and CloSpan) :
6 Frequent Closed Sequences
CloSpan
Maintains the set of already mined frequent closed patterns in memory.
Sub-pattern checking: new pattern can be absorbed by an already mined frequent pattern. Super-pattern checking: new pattern can absorb some already mined frequent patterns.
To save space, CloSpan stores a superset of the frequent patterns in a hash-tree indexed structure and then prunes the tree to get the actual set of frequent closed patterns.
BIDE (1/2)
BIDE : An efficient algorithm for discovering the complete set
of frequent closed sequences.
BI-Directional Extension: new paradigm for mining frequent closed sequences without candidate maintenance.
Back-Scan pruning method.
Back-Skip optimization technique. Performance study: BIDE can be over an order of magnitude faster than the previous algorithms and consumes orders of magnitude less memory.
BIDE (2/2)
How to enumerate the complete set of frequent sequences?
Upon getting a frequent sequence, how to check if it is
closed?
How to design some search space pruning methods or other optimization techniques to accelerate the mining process?
First instance
S =CAABC Sp = A B First instance of a prefix sequence Sp = C A A B Projected Sequence of a prefix sequence Sp = C Projected Database of a prefix sequence Sp = Sp_SDB.
i-th last-in-first appearance
S =CAABC Sp = C A first instance of a prefix sequence Sp = C A
2nd last-in-first appearance = the last A of first instance
i-th semi-maximum period : SMP i( Sp )
S =ABCB Sp = A C the end of the first instance of prefix e1e2ei-1 in S = A between the 2nd last-in-first appearance = C SMP 2( A C) = B SMP 1( A C) =
Back-Scan search space pruning method
Theorem : Let the pre-fix sequence be an n-sequence, Sp = e1e2 . . . en. If i (1 i n), there exists an item e which appears in each of the i-th SMP of the prefix Sp in SDB, we can safely stop growing prefix Sp.
I = { A, B, C } min_sup = 50%
Sp = B : 4 SMP1 = {CAA, A, CA, A} Prune sub-tree under B
BI-Directional Extension closure checking
Sp = e1e2en Sp* = e1e2en e supSDB(Sp) = supSDB(Sp*) Sp is non-closed, item e is a forward extension event. Sp = e1e2en Sp2 = e1e2ei e ei+1en Sp = e1e2en Sp2 = e e1e2en supSDB(Sp) = supSDB(Sp*) Sp is non-closed, item e is a backward extension event.
or
Theorem : If there exists no forward extension event nor backward extension event w.r.t. Sp, Sp must be closed; otherwise it must be non-closed.
Last instance
S =CABDCABBA Sp = A B Last instance of a prefix sequence Sp = C A B D C A B B
i-th last-in-last appearance
S =CAABC Sp = A B Last instance of a prefix sequence Sp = C A A B
1st last-in-last appearance = the last A of last instance
i-th maximum period : MP i( Sp )
S =ABCB Sp = A B the end of the first instance of prefix e1e2ei-1 in S = A between the 2nd last-in-last appearance = B MP 2( A B) = B C MP 1( A B) =
Backward-Extension event checking
Lemma If there exists an item e which appears in each of the i-th maximum periods of Sp in SDB, then e is a backward extension event w.r.t. Sp. SDB I = { A, B, C } min_sup = 50%
Sp = A C : 4
MP2 = {AB, B, B, BB} Backward Extension Event for Sp(AC) is B. AC:4 is a frequent non-closed sequence. Sp = A B C : 4 No backward-extension item for Sp. No forward-extension item for Sp. ABC:4 is a frequent closed sequence.
Scan-Skip optimization
SDB I = { A, B, C } min_sup = 50% Sp = ABC with support 4.
MP1 = {CA, , C, }
skip last two
MP2 = {A, , , B}
skip last two
MP3 = {, , , B}
skip last three
Pruning and optimization
Projected Sequence & Projected Database
SDB
Sequence identifier
1 2 3 4
Sp_SDB = {C, CB, C, BCA}
Sequence
CAABC ABCB CABC ABBCA
[Sp = AB]
Sequence identifier 1 2 3 4
Sequence C CB C BCA
Frequent Sequence Enumeration (Prefix-Span)
Forward-extension event checking
Lemma For a pre-fix sequence Sp, its complete set of forward-extension events is equivalent to the set of its locally frequent items whose supports are equal to supSDB(Sp) .
SDB I = { A, B, C } min_sup = 50%
Sp1 = AB : 4 Sp1_SDB = {C, CB, C, BCA} Forward Extension Event for AB is C. Sp2 = CAB : 2 SP2_SDB = {C, , C, } Forward Extension Event for CAB is C.
Scheme of Bide Algorithm
SDB min_sup
Sp (Start from frequent 1-sequences)
Grow Sp with extension event.
BackScan pruning
Stop growing Sp.
BI-Directional Extension closure checking Backward-Extension event checking Find extension events. frequent non-closed sequences Forward-extension event checking No extension event. frequent closed sequences
The BIDE algorithm (1/2)
Scan the database once in order to find all the frequent 1sequences.
For each frequent 1-sequence build a pseudo-projected database and check if it can be pruned: Back-Scan. For every sequence that cannot be pruned compute the number of backward-extension items and then call subroutine bide.
The BIDE algorithm (2/2)
Subroutine bide:
For prefix Sp scan its projected DB. Compute the number of forward-extension items. If there is no forward-extension item nor backward-extension item, Sp is closed. Grow Sp with each locally frequent item to get a new prefix. Build the pseudo-projected DB for the new prefix. For each new prefix:
Check if it can be pruned If not, compute the number of backward-extension items and call itself
Performance Evaluation
BIDE can significantly outperform PrefixSpan and SPADE. BIDE consumes much less memory and can be faster than CloSpan. BackScan pruning method is effective in enhancing the performance. Experiments on three datasets.
SPADE vs. PrefixSpan vs. CloSpan vs. BIDE
CloSpan vs. BIDE(1/3)
Gazelle Dataset
CloSpan vs. BIDE(2/3)
Snake Dataset
CloSpan vs. BIDE(3/3)
Pi Dataset
Conclusions
Closed sequence mining: More compact result set Significantly better efficiency BIDE: Avoids the curse of candidate maintenance Prunes search space more deeply Consumes much less memory than CloSpan in closure checking Future Work: Push constraints into the mining process
Thank you !