0% found this document useful (0 votes)

135 views36 pages

Efficient Closed Sequence Mining

The document summarizes the BIDE algorithm for efficiently mining frequent closed sequences from sequence databases. BIDE uses bidirectional extension and back-scan pruning to avoid candidate maintenance and deeply prune the search space. It performs closure checking without maintaining candidates in memory. Experiments show BIDE outperforms previous algorithms like CloSpan, PrefixSpan, and SPADE by being over an order of magnitude faster and using orders of magnitude less memory.

Uploaded by

Raghu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

135 views36 pages

Efficient Closed Sequence Mining

Uploaded by

Raghu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 36

BIDE : Efficient Mining of Frequent Closed Sequences

Jianyong Wang and Jiawei Han Proc. 2004 Int. Conf. on Data Engineering (ICDE'04), Boston, MA, March 2004.

Opening
{,,} sup = 5 {,,,} sup = 5 sup < 5 sup > 5

{, ,} sup = 5 {,, ,} sup = 5

Review of Algorithms
Problem was first introduced by Agrawal and Srikant 1995. Mostly for mining frequent sequences (closed and non-closed). frequent sequences :
Apriori-based (Apriori, AprioriAll, GSP) Sequence-Enumeration Tree/Lattice-based (Max-Miner, Spam, Spade) Constraint-based (SPIRIT Family)

frequent closed sequences :

CloSpan : The first algorithm for mining frequent closed sequences (Based on Prefix-Span). BIDE

Apriori-based algorithms
Make multiple passes over the database.
First Pass:
Find the 1-frequent sequences Join them to build the candidate 2-sequences

For step k:
Join the set of frequent (k-1)-sequences Scan the SDB to check which of the k-sequences are actually frequent

Stop at step m:
No frequent m-sequence is found

Tree-based algorithms (1/2)

Use a sequence enumeration tree to generate all the candidate sequences. Traverse the tree (DFS). Use pruning techniques to avoid traversing sub-trees.
Apriori-principle: If a sequence is non-frequent, its supersequences can not be frequent.

If an extension event leads to a non-frequent sequence, this event is no longer used in other sequence extensions.

Tree-based algorithms (2/2)

I = {a, b} S-step:

add item to sequence

I-step: add item to event

Definition of Terms (1/2)

Items set: I = {i1, i2, ,in} Ordered Events set (Sequence) : S = {e1, e2, , em} Length : m-sequence Subsequence : Sa = {a1, a2, , an} Super-sequence : Sb = {b1, b2, , bm}
a1= bi1, a2 = bi2, , an = bin 1 i1 <i2<< in m

Definition of Terms (2/2)

SDB : sequence database tuple (sid, S) S : is a subsequence of S. supSDB(Sa) : Absolute support supSDB(Sa) / |SDB| : Relative support min_sup : supSDB(Sa) min_sup

S is closed: if there exists no other super-sequence with the same support.

Frequent Closed Sequences

Find the complete set of frequent closed sequences :
SDB
I = { A, B, C } min_sup = 50%

Previous Algorithms : 17 Frequent Sequences

BIDE (and CloSpan) :

6 Frequent Closed Sequences

CloSpan
Maintains the set of already mined frequent closed patterns in memory.
Sub-pattern checking: new pattern can be absorbed by an already mined frequent pattern. Super-pattern checking: new pattern can absorb some already mined frequent patterns.

To save space, CloSpan stores a superset of the frequent patterns in a hash-tree indexed structure and then prunes the tree to get the actual set of frequent closed patterns.

BIDE (1/2)
BIDE : An efficient algorithm for discovering the complete set

of frequent closed sequences.

BI-Directional Extension: new paradigm for mining frequent closed sequences without candidate maintenance.

Back-Scan pruning method.

Back-Skip optimization technique. Performance study: BIDE can be over an order of magnitude faster than the previous algorithms and consumes orders of magnitude less memory.

BIDE (2/2)
How to enumerate the complete set of frequent sequences?
Upon getting a frequent sequence, how to check if it is

closed?
How to design some search space pruning methods or other optimization techniques to accelerate the mining process?

First instance
S =CAABC Sp = A B First instance of a prefix sequence Sp = C A A B Projected Sequence of a prefix sequence Sp = C Projected Database of a prefix sequence Sp = Sp_SDB.

i-th last-in-first appearance

S =CAABC Sp = C A first instance of a prefix sequence Sp = C A

2nd last-in-first appearance = the last A of first instance

i-th semi-maximum period : SMP i( Sp )

S =ABCB Sp = A C the end of the first instance of prefix e1e2ei-1 in S = A between the 2nd last-in-first appearance = C SMP 2( A C) = B SMP 1( A C) =

Back-Scan search space pruning method

Theorem : Let the pre-fix sequence be an n-sequence, Sp = e1e2 . . . en. If i (1 i n), there exists an item e which appears in each of the i-th SMP of the prefix Sp in SDB, we can safely stop growing prefix Sp.

I = { A, B, C } min_sup = 50%

Sp = B : 4 SMP1 = {CAA, A, CA, A} Prune sub-tree under B

BI-Directional Extension closure checking

Sp = e1e2en Sp* = e1e2en e supSDB(Sp) = supSDB(Sp*) Sp is non-closed, item e is a forward extension event. Sp = e1e2en Sp2 = e1e2ei e ei+1en Sp = e1e2en Sp2 = e e1e2en supSDB(Sp) = supSDB(Sp*) Sp is non-closed, item e is a backward extension event.
or

Theorem : If there exists no forward extension event nor backward extension event w.r.t. Sp, Sp must be closed; otherwise it must be non-closed.

Last instance
S =CABDCABBA Sp = A B Last instance of a prefix sequence Sp = C A B D C A B B

i-th last-in-last appearance

S =CAABC Sp = A B Last instance of a prefix sequence Sp = C A A B

1st last-in-last appearance = the last A of last instance

i-th maximum period : MP i( Sp )

S =ABCB Sp = A B the end of the first instance of prefix e1e2ei-1 in S = A between the 2nd last-in-last appearance = B MP 2( A B) = B C MP 1( A B) =

Backward-Extension event checking

Lemma If there exists an item e which appears in each of the i-th maximum periods of Sp in SDB, then e is a backward extension event w.r.t. Sp. SDB I = { A, B, C } min_sup = 50%

Sp = A C : 4
MP2 = {AB, B, B, BB} Backward Extension Event for Sp(AC) is B. AC:4 is a frequent non-closed sequence. Sp = A B C : 4 No backward-extension item for Sp. No forward-extension item for Sp. ABC:4 is a frequent closed sequence.

Scan-Skip optimization
SDB I = { A, B, C } min_sup = 50% Sp = ABC with support 4.

MP1 = {CA, , C, }

skip last two

MP2 = {A, , , B}

skip last two

MP3 = {, , , B}

skip last three

Pruning and optimization

Projected Sequence & Projected Database

SDB

Sequence identifier
1 2 3 4
Sp_SDB = {C, CB, C, BCA}

Sequence
CAABC ABCB CABC ABBCA
[Sp = AB]

Sequence identifier 1 2 3 4

Sequence C CB C BCA

Frequent Sequence Enumeration (Prefix-Span)

Forward-extension event checking

Lemma For a pre-fix sequence Sp, its complete set of forward-extension events is equivalent to the set of its locally frequent items whose supports are equal to supSDB(Sp) .

SDB I = { A, B, C } min_sup = 50%

Sp1 = AB : 4 Sp1_SDB = {C, CB, C, BCA} Forward Extension Event for AB is C. Sp2 = CAB : 2 SP2_SDB = {C, , C, } Forward Extension Event for CAB is C.

Scheme of Bide Algorithm

SDB min_sup

Sp (Start from frequent 1-sequences)

Grow Sp with extension event.

BackScan pruning

Stop growing Sp.

BI-Directional Extension closure checking Backward-Extension event checking Find extension events. frequent non-closed sequences Forward-extension event checking No extension event. frequent closed sequences

The BIDE algorithm (1/2)

Scan the database once in order to find all the frequent 1sequences.
For each frequent 1-sequence build a pseudo-projected database and check if it can be pruned: Back-Scan. For every sequence that cannot be pruned compute the number of backward-extension items and then call subroutine bide.

The BIDE algorithm (2/2)

Subroutine bide:
For prefix Sp scan its projected DB. Compute the number of forward-extension items. If there is no forward-extension item nor backward-extension item, Sp is closed. Grow Sp with each locally frequent item to get a new prefix. Build the pseudo-projected DB for the new prefix. For each new prefix:
Check if it can be pruned If not, compute the number of backward-extension items and call itself

Performance Evaluation
BIDE can significantly outperform PrefixSpan and SPADE. BIDE consumes much less memory and can be faster than CloSpan. BackScan pruning method is effective in enhancing the performance. Experiments on three datasets.

SPADE vs. PrefixSpan vs. CloSpan vs. BIDE

CloSpan vs. BIDE(1/3)

Gazelle Dataset

CloSpan vs. BIDE(2/3)

Snake Dataset

CloSpan vs. BIDE(3/3)

Pi Dataset

Conclusions
Closed sequence mining: More compact result set Significantly better efficiency BIDE: Avoids the curse of candidate maintenance Prunes search space more deeply Consumes much less memory than CloSpan in closure checking Future Work: Push constraints into the mining process

Thank you !

SM 1000 Idi Reference Manual
No ratings yet
SM 1000 Idi Reference Manual
108 pages
Aies Unit - 2
No ratings yet
Aies Unit - 2
28 pages
JNTUH Syllabus 2013 M.Tech CSE
No ratings yet
JNTUH Syllabus 2013 M.Tech CSE
33 pages
Dbms PPT For Chapter 7
No ratings yet
Dbms PPT For Chapter 7
45 pages
AFMAN 33-363 Management of Records PDF
No ratings yet
AFMAN 33-363 Management of Records PDF
59 pages
Previous Year Question Papers
No ratings yet
Previous Year Question Papers
7 pages
Fds MCQ Set1 Sppu Se Computer Fds MCQ
No ratings yet
Fds MCQ Set1 Sppu Se Computer Fds MCQ
4 pages
20ISL47A OOPS With JAVA Lab Manual
100% (1)
20ISL47A OOPS With JAVA Lab Manual
30 pages
Unit2 Skiplist
No ratings yet
Unit2 Skiplist
10 pages
Enterprise Java Unit 5
No ratings yet
Enterprise Java Unit 5
10 pages
Unit II: Software Requirement Analysis and Specifications
No ratings yet
Unit II: Software Requirement Analysis and Specifications
64 pages
Data Structures 2
No ratings yet
Data Structures 2
82 pages
Unit 5 - Data Compression
No ratings yet
Unit 5 - Data Compression
46 pages
Data Mining Models - GeeksforGeeks
No ratings yet
Data Mining Models - GeeksforGeeks
4 pages
Line Segment Properties
No ratings yet
Line Segment Properties
28 pages
Lecture 12 Structures
No ratings yet
Lecture 12 Structures
37 pages
CS-201 Data Structure 3-1-0-4 3 Sem (CSE) Prerequisites: CS-101
No ratings yet
CS-201 Data Structure 3-1-0-4 3 Sem (CSE) Prerequisites: CS-101
1 page
AI & Soft Computing Lab Manual
No ratings yet
AI & Soft Computing Lab Manual
30 pages
Compare DFS & BFS Graph Traversals
No ratings yet
Compare DFS & BFS Graph Traversals
6 pages
Algorithms and Data Structures: Dynamic Programming Matrix-Chain Multiplication
No ratings yet
Algorithms and Data Structures: Dynamic Programming Matrix-Chain Multiplication
17 pages
Java Lab Manual r23 Updated
No ratings yet
Java Lab Manual r23 Updated
77 pages
11 M-Way Search Trees
No ratings yet
11 M-Way Search Trees
33 pages
Limited-Contention Protocols
No ratings yet
Limited-Contention Protocols
15 pages
2-QUESTION PAPER DR K UMA Question Bank CS3001 SOFTWARE ENGG-converted1
No ratings yet
2-QUESTION PAPER DR K UMA Question Bank CS3001 SOFTWARE ENGG-converted1
71 pages
Software Testing & Maintenance Guide
No ratings yet
Software Testing & Maintenance Guide
37 pages
Flowchart of Sequential Search: Begin
No ratings yet
Flowchart of Sequential Search: Begin
2 pages
Week 9
No ratings yet
Week 9
4 pages
DBMS UNIT IV NOTES File Organization and Indexing
No ratings yet
DBMS UNIT IV NOTES File Organization and Indexing
64 pages
Data Structures Notes Unit-2
No ratings yet
Data Structures Notes Unit-2
60 pages
Unit-5 Unit-5: Case Studies of Big Data Analytics Using Map-Reduce Programming
No ratings yet
Unit-5 Unit-5: Case Studies of Big Data Analytics Using Map-Reduce Programming
11 pages
6.1 Emerging Databases
No ratings yet
6.1 Emerging Databases
18 pages
Software Engineering: Reference: Prof. Rajib Mall
No ratings yet
Software Engineering: Reference: Prof. Rajib Mall
95 pages
1.1 Introduction To Data Structures Linear and Non Linear Data
No ratings yet
1.1 Introduction To Data Structures Linear and Non Linear Data
5 pages
5 Ways of Increasing The Capacity of Cellular System
100% (1)
5 Ways of Increasing The Capacity of Cellular System
7 pages
Web Development Using PHP
No ratings yet
Web Development Using PHP
65 pages
Fundamentals of Algorithmic Problem Solving: B.B. Karki, LSU 2.1 CSC 3102
No ratings yet
Fundamentals of Algorithmic Problem Solving: B.B. Karki, LSU 2.1 CSC 3102
4 pages
ADBMS Lab Manual
No ratings yet
ADBMS Lab Manual
33 pages
Stqa File
No ratings yet
Stqa File
38 pages
FDS Unit 5
No ratings yet
FDS Unit 5
22 pages
Instruction Level Parallelism
No ratings yet
Instruction Level Parallelism
2 pages
Divide and Conquer
No ratings yet
Divide and Conquer
54 pages
Introduction to Software Engineering
No ratings yet
Introduction to Software Engineering
29 pages
AVL Tree
No ratings yet
AVL Tree
34 pages
Chapter 10: Algorithms 10.1. Deterministic and Non-Deterministic Algorithm
No ratings yet
Chapter 10: Algorithms 10.1. Deterministic and Non-Deterministic Algorithm
5 pages
5.4 LCS
No ratings yet
5.4 LCS
9 pages
BCS401 Module 4
No ratings yet
BCS401 Module 4
42 pages
Unit - III Testing - Unit Testing, Black-Box Testing, White-Box Testing
No ratings yet
Unit - III Testing - Unit Testing, Black-Box Testing, White-Box Testing
6 pages
Chapter 9 (Machine-Independent Optimizations)
No ratings yet
Chapter 9 (Machine-Independent Optimizations)
37 pages
CTOOD CO1 CO2 Notes
100% (1)
CTOOD CO1 CO2 Notes
173 pages
Short Notes With Questions PDF
100% (1)
Short Notes With Questions PDF
237 pages
Data Structures and Algorithms Implementation Through C 1st Edition Dr. Brijesh Bakariya. Download
No ratings yet
Data Structures and Algorithms Implementation Through C 1st Edition Dr. Brijesh Bakariya. Download
128 pages
External Practical File PDF
No ratings yet
External Practical File PDF
39 pages
DSAD Dynamic Hashing
No ratings yet
DSAD Dynamic Hashing
79 pages
R18 DBMS Unit-V
No ratings yet
R18 DBMS Unit-V
43 pages
Unit-4 (Part-1) Backtracking
No ratings yet
Unit-4 (Part-1) Backtracking
39 pages
QB Answers Ia 1 18ai733
No ratings yet
QB Answers Ia 1 18ai733
11 pages
ADSA Unit-4
No ratings yet
ADSA Unit-4
16 pages
Expression Tree
No ratings yet
Expression Tree
18 pages
Unit - V Implementation, Testing & Maintenance
No ratings yet
Unit - V Implementation, Testing & Maintenance
60 pages
System Design Activities
No ratings yet
System Design Activities
41 pages
UML & Diagramming Tools Guide
No ratings yet
UML & Diagramming Tools Guide
17 pages
Huang 2006
No ratings yet
Huang 2006
12 pages
Project Plan - Energy Consumption Modeling
No ratings yet
Project Plan - Energy Consumption Modeling
5 pages
List of Experiments For Random Forest and Xgboost Models
No ratings yet
List of Experiments For Random Forest and Xgboost Models
1 page
WiFi Manager
No ratings yet
WiFi Manager
4 pages
Load Survey - 17098375 - 21-Mar-2025 11-41-01-044 AM
No ratings yet
Load Survey - 17098375 - 21-Mar-2025 11-41-01-044 AM
21 pages
Web Server
No ratings yet
Web Server
9 pages
API Client.c
No ratings yet
API Client.c
10 pages
DAA Notes
No ratings yet
DAA Notes
200 pages
Bi Connected Components
No ratings yet
Bi Connected Components
7 pages
SEO-Optimized Document Title
No ratings yet
SEO-Optimized Document Title
25 pages
MATLAB vs Octave: A Beginner's Guide
No ratings yet
MATLAB vs Octave: A Beginner's Guide
91 pages
Hot Rotary Kiln Deformability For Cement Plant Exp
No ratings yet
Hot Rotary Kiln Deformability For Cement Plant Exp
10 pages
Unit III
No ratings yet
Unit III
41 pages
Centricity Pacs Quick Guide
No ratings yet
Centricity Pacs Quick Guide
6 pages
Installation
No ratings yet
Installation
6 pages
Visioe Rror MSG
No ratings yet
Visioe Rror MSG
24 pages
Designing For DTG: Prep School: File Type
No ratings yet
Designing For DTG: Prep School: File Type
11 pages
Wizolayer: Whitepaper
No ratings yet
Wizolayer: Whitepaper
16 pages
Evolution of Computer-Aided Digital Design
No ratings yet
Evolution of Computer-Aided Digital Design
20 pages
Ass Dbi
No ratings yet
Ass Dbi
11 pages
Installation or Run AstroHora File
No ratings yet
Installation or Run AstroHora File
5 pages
Model CV1
No ratings yet
Model CV1
2 pages
Nas326 3
No ratings yet
Nas326 3
6 pages
Ujwal Maharjan IT CV & Experience
No ratings yet
Ujwal Maharjan IT CV & Experience
2 pages
Double Sense Manual 98
No ratings yet
Double Sense Manual 98
44 pages
Strapless Futalicious For G8F: B - Manual Procedure
No ratings yet
Strapless Futalicious For G8F: B - Manual Procedure
4 pages
Log
No ratings yet
Log
390 pages
File Adapter Parameters
100% (4)
File Adapter Parameters
11 pages
Geeetech A20M 3D Printer Guide
No ratings yet
Geeetech A20M 3D Printer Guide
56 pages
June 2022 MS - Paper 1 Computer Edexcel Science GCSE
No ratings yet
June 2022 MS - Paper 1 Computer Edexcel Science GCSE
21 pages
Pison VH10 User Manual
No ratings yet
Pison VH10 User Manual
166 pages
Element2 L1 Essentials V2.9a
No ratings yet
Element2 L1 Essentials V2.9a
44 pages
Tej3m Network Design 2014 Final
No ratings yet
Tej3m Network Design 2014 Final
3 pages
Salinan Dari Copy of Genshin Impact Materials Tracker (By Oble)
No ratings yet
Salinan Dari Copy of Genshin Impact Materials Tracker (By Oble)
242 pages
New Cisco Certification
No ratings yet
New Cisco Certification
2 pages
Sap Nwds Install and Upgrade
No ratings yet
Sap Nwds Install and Upgrade
14 pages
5.1 Using Network Configuration Tools: Unit V:Networking and TCP/IP
No ratings yet
5.1 Using Network Configuration Tools: Unit V:Networking and TCP/IP
20 pages
Foundry Certification Guide - Solution Architect
No ratings yet
Foundry Certification Guide - Solution Architect
6 pages
Course Title: CP Assignment No.3: Bahria University, Islamabad
No ratings yet
Course Title: CP Assignment No.3: Bahria University, Islamabad
5 pages

Efficient Closed Sequence Mining

Uploaded by

Efficient Closed Sequence Mining

Uploaded by

BIDE : Efficient Mining of Frequent Closed Sequences

{, ,} sup = 5 {,, ,} sup = 5

frequent closed sequences :

Tree-based algorithms (1/2)

Tree-based algorithms (2/2)

add item to sequence

Definition of Terms (1/2)

Definition of Terms (2/2)

S is closed: if there exists no other super-sequence with the same support.

Frequent Closed Sequences

Previous Algorithms : 17 Frequent Sequences

BIDE (and CloSpan) :

of frequent closed sequences.

Back-Scan pruning method.

i-th last-in-first appearance

2nd last-in-first appearance = the last A of first instance

i-th semi-maximum period : SMP i( Sp )

Back-Scan search space pruning method

Sp = B : 4 SMP1 = {CAA, A, CA, A} Prune sub-tree under B

BI-Directional Extension closure checking

i-th last-in-last appearance

1st last-in-last appearance = the last A of last instance

i-th maximum period : MP i( Sp )

Backward-Extension event checking

skip last two

skip last two

skip last three

Pruning and optimization

Projected Sequence & Projected Database

Frequent Sequence Enumeration (Prefix-Span)

Forward-extension event checking

SDB I = { A, B, C } min_sup = 50%

Scheme of Bide Algorithm

Sp (Start from frequent 1-sequences)

Grow Sp with extension event.

Stop growing Sp.

The BIDE algorithm (1/2)

The BIDE algorithm (2/2)

SPADE vs. PrefixSpan vs. CloSpan vs. BIDE

CloSpan vs. BIDE(1/3)

CloSpan vs. BIDE(2/3)

CloSpan vs. BIDE(3/3)

You might also like