Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
43 views31 pages

Multi-Pattern String Matching With Very Large Pattern Sets: Leena Salmela

This document summarizes an approach for multi-pattern string matching with very large pattern sets. It begins by defining the problem and motivation, which is to efficiently find occurrences of over 10,000 patterns in a text. It then reviews limitations of previous trie-based and Rabin-Karp algorithms. The presented approach uses a filtering technique, where a fast filter first finds potential matches, and a slower verifier then confirms actual matches. For filtering, it constructs a generalized pattern from the patterns using character classes and q-grams to handle large pattern sets. It implemented this filtering approach using three different character class algorithms.

Uploaded by

k.harini89
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views31 pages

Multi-Pattern String Matching With Very Large Pattern Sets: Leena Salmela

This document summarizes an approach for multi-pattern string matching with very large pattern sets. It begins by defining the problem and motivation, which is to efficiently find occurrences of over 10,000 patterns in a text. It then reviews limitations of previous trie-based and Rabin-Karp algorithms. The presented approach uses a filtering technique, where a fast filter first finds potential matches, and a slower verifier then confirms actual matches. For filtering, it constructs a generalized pattern from the patterns using character classes and q-grams to handle large pattern sets. It implemented this filtering approach using three different character class algorithms.

Uploaded by

k.harini89
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Multi-Pattern String Matching

with Very Large Pattern Sets

Leena Salmela

L. Salmela, J. Tarhio and J. Kytöjoki: Multi-pattern string matching with q-grams.


ACM Journal of Experimental Algorithmics, Volume 11, 2006.

November 1st 2007

Leena Salmela Multi-Pattern String Matching November 1st 2007 1 / 22


Outline

Problem Definition and Motivation

Previous Algorithms

Filtering Approach
Verification
Filtering

Experimental Results

Leena Salmela Multi-Pattern String Matching November 1st 2007 2 / 22


Problem Definition and Motivation

Outline

Problem Definition and Motivation

Previous Algorithms

Filtering Approach
Verification
Filtering

Experimental Results

Leena Salmela Multi-Pattern String Matching November 1st 2007 3 / 22


Problem Definition and Motivation

Problem Definition

Definition (Multiple pattern matching problem)


Given a pattern set P and a text t, report all occurrences of all the
patterns in the text.

◮ The text t is a string of n characters drawn from the alphabet Σ (of


size σ).
◮ The pattern set P is a set of r patterns each of which is a string of
characters over the alphabet Σ.
◮ For simplicity we assume that all patterns have the same length m.
◮ We are especially interested in searching for large pattern sets
(>10,000 patterns)

Leena Salmela Multi-Pattern String Matching November 1st 2007 4 / 22


Problem Definition and Motivation

Why large pattern sets?

◮ Applications where large pattern sets are needed:


◮ Antivirus scanning (around 100,000 known viruses)
◮ Intrusion detection
◮ Bioinformatics
◮ Older algorithms were not developed for such large pattern sets and
they do not scale very well.

Leena Salmela Multi-Pattern String Matching November 1st 2007 5 / 22


Previous Algorithms

Outline

Problem Definition and Motivation

Previous Algorithms

Filtering Approach
Verification
Filtering

Experimental Results

Leena Salmela Multi-Pattern String Matching November 1st 2007 6 / 22


Previous Algorithms

Trie-Based Algorithms

i
t ◮ Aho-Corasick, Commentz-Walter, SBOM
?
i etc.
h @i ◮ Many multi pattern algorithms build a trie of

i Ri
@
the patterns and search the text with the aid
e m of the trie.
?
y ?
i ◮ The trie grows quite rapidly as the pattern
y e set grows.
?
y ?
y ◮ For σ = 256, m = 8 and 100,000 patterns
the trie takes 500 MB of memory.
=⇒ Trie-based algorithms are not practical
Figure: Trie built of for large pattern sets.
the patterns “the”,
“they” and “time”.

Leena Salmela Multi-Pattern String Matching November 1st 2007 7 / 22


Previous Algorithms

Rabin-Karp (for Single Pattern)

Preprocessing
1. Compute a hash of the pattern hs(p0 ...pm−1 )

Searching
1. For each text position i compute the hash hs(ti ...ti +m−1 )
2. If the hash equals the hash of the pattern, verify the match.

Leena Salmela Multi-Pattern String Matching November 1st 2007 8 / 22


Previous Algorithms

Multiple Pattern Matching Based on Rabin-Karp

Preprocessing
1. Compute the hash of each pattern hs(p0i ...pm−1
i ) and store them.
2. Sort the patterns according to the hash values.

Searching
1. For each text position i compute the hash hs(ti ...ti +m−1 )
2. Search for the hash value from the saved hash values of the patterns
using binary search.
3. If the hash equals the hash of a pattern, verify the match.

Leena Salmela Multi-Pattern String Matching November 1st 2007 9 / 22


Filtering Approach

Outline

Problem Definition and Motivation

Previous Algorithms

Filtering Approach
Verification
Filtering

Experimental Results

Leena Salmela Multi-Pattern String Matching November 1st 2007 10 / 22


Filtering Approach

Filtering Approach

   
A match?
- Maybe - Yes -
Filter Verifier
   
No No
? ?

◮ Given a text position, a filter can tell if there cannot be a match at


this position.
◮ The hashes in the (single pattern) Rabin-Karp algorithm act as a
filter; If the hashes do not match there cannot be a match at that
position.
◮ A good filter is fast and produces few false positives.
◮ A verifier is needed to distinguish between false and true positives.

Leena Salmela Multi-Pattern String Matching November 1st 2007 11 / 22


Filtering Approach Verification

Verification

◮ Verification of a single pattern is easy. (pairwise comparison)


◮ In a multiple pattern algorithm, the filter only tells some of the
patterns might match
=⇒ The verifier also needs to figure out which pattern to try.
◮ Using a trie would work but needs a lot of space (something we
wanted to avoid in the first place)
◮ The verifier should be space-efficient and faster than pairwise
comparison of all patterns against the given text position
=⇒ Rabin-Karp for multiple patterns!

Leena Salmela Multi-Pattern String Matching November 1st 2007 12 / 22


Filtering Approach Filtering

Character Class Filter

◮ Given a set of patterns...

p a t t e r n
f i l t e r s

Leena Salmela Multi-Pattern String Matching November 1st 2007 13 / 22


Filtering Approach Filtering

Character Class Filter

◮ Given a set of patterns...


◮ ...construct a generalized pattern with character classes and apply any
algorithm capable of handling such generalized patterns.

p a t t e r n
f i l t e r s

[f,p] [a,i] [l,t] [t] [e] [r] [n,s]

Leena Salmela Multi-Pattern String Matching November 1st 2007 13 / 22


Filtering Approach Filtering

Character Class Filter

◮ Given a set of patterns...


◮ ...construct a generalized pattern with character classes and apply any
algorithm capable of handling such generalized patterns.
◮ How to make it work with very large pattern sets?

p a t t e r n
f i l t e r s

[f,p] [a,i] [l,t] [t] [e] [r] [n,s]

Leena Salmela Multi-Pattern String Matching November 1st 2007 13 / 22


Filtering Approach Filtering

Character Class Filter with q-Grams

◮ Given a set of patterns...


◮ ...construct a generalized pattern with character classes and apply any
algorithm capable of handling such generalized patterns.
◮ How to make it work with very large pattern sets?
◮ Use superalphabets (q-grams)

p a t t e r n → pa at tt te er rn
f i l t e r s → fi il lt te er rs

Leena Salmela Multi-Pattern String Matching November 1st 2007 13 / 22


Filtering Approach Filtering

Character Class Filter with q-Grams

◮ Given a set of patterns...


◮ ...construct a generalized pattern with character classes and apply any
algorithm capable of handling such generalized patterns.
◮ How to make it work with very large pattern sets?
◮ Use superalphabets (q-grams)
◮ ...and construct a generalized pattern.
p a t t e r n → pa at tt te er rn
f i l t e r s → fi il lt te er rs

[fi,pa] [at,il] [lt,tt] [te] [er] [rn,rs]

Leena Salmela Multi-Pattern String Matching November 1st 2007 13 / 22


Filtering Approach Filtering

Character Class Filters

◮ The character class filter is truly a filter


◮ It recognizes any occurrence of the pattern.
◮ False positives are also found. (I.e. “filtern” and “patters” are
recognized by the filter on the previous slide.)
◮ We have implemented the filter with three different character class
algorithms:
◮ Multi-Pattern Horspool with q-Grams (HG)
◮ A Boyer-Moore-Horspool type algorithm
◮ Multi-Pattern Shift-Or with q-Grams (SOG)
◮ Shift-Or (simplest, presented in following slides)
◮ Multi-Pattern BNDM with q-Grams (BG)
◮ BNDM (average optimal for q = O(logσ r ), fastest in practise)

Leena Salmela Multi-Pattern String Matching November 1st 2007 14 / 22


Filtering Approach Filtering

Shift-Or Character Class Filter

◮ Suppose we are searching for patterns “lift” and “time” so the


character class pattern is “[l,t][i][f,m][e,t]”.
◮ The following NFA finds all occurrences of the character class pattern:
ǫ
@
    
e|t - 
R
@
@ 0 l|t - 1 i - f|m -
2 3 4

    

◮ The shift-or algorithm is a bit-parallel simulation of this automaton.

Leena Salmela Multi-Pattern String Matching November 1st 2007 15 / 22


Filtering Approach Filtering

Shift-Or Character Class Filter: Preprocessing

◮ For each character c of the alphabet, initialize a bit vector T [c] such
that the i ’th bit is 0 iff the character appears in any of the patterns in
position i .
◮ In our example (patterns “lift” and “time”):
T [’e’] 0111
T [’f’] 1011
T [’i’] 1101
T [’l’] 1110
T [’m’] 1011
T [’t’] 0110
◮ The automaton has a transition from state i to state i + 1 on
character c iff i :th bit in T [c] is 0.

Leena Salmela Multi-Pattern String Matching November 1st 2007 16 / 22


Filtering Approach Filtering

Shift-Or Character Class Filter: Searching

◮ State vector E where i ’th bit is 0 iff state i in the automaton is active.
◮ Initialize E as 1111.
◮ Update E when a character c is read from the text:

E = (E ≪ 1) | T [c]

◮ After the update, i ’th bit in E is 0 iff i − 1:th bit was 0 (the previous
state i − 1 was active) and i ’th bit is 0 in T [c] (there is a transition
from state i − 1 to i on c).

Leena Salmela Multi-Pattern String Matching November 1st 2007 17 / 22


Filtering Approach Filtering

Shift-Or Character Class Filter: Searching

Matching against the text: “ttime”

E = 1111

ǫ
@
    
e|t - 
R
@
@ 0~ l|t - 1 i -
2
f|m -
3 4
    


Leena Salmela Multi-Pattern String Matching November 1st 2007 18 / 22


Filtering Approach Filtering

Shift-Or Character Class Filter: Searching

Matching against the text: “ttime”

E = 1111
Read ’t’ E = (1111 ≪ 1) | 0110 = 1110

ǫ
@
    
e|t - 
R
@
@ 0~ l|t - 1~ i - 2 f|m -
3 4
    


Leena Salmela Multi-Pattern String Matching November 1st 2007 18 / 22


Filtering Approach Filtering

Shift-Or Character Class Filter: Searching

Matching against the text: “ttime”

E = 1111
Read ’t’ E = (1111 ≪ 1) | 0110 = 1110
Read ’t’ E = (1110 ≪ 1) | 0110 = 1110

ǫ
@
    
e|t - 
R
@
@ 0~ l|t - 1~ i - 2 f|m -
3 4
    


Leena Salmela Multi-Pattern String Matching November 1st 2007 18 / 22


Filtering Approach Filtering

Shift-Or Character Class Filter: Searching

Matching against the text: “ttime”

E = 1111
Read ’t’ E = (1111 ≪ 1) | 0110 = 1110
Read ’t’ E = (1110 ≪ 1) | 0110 = 1110
Read ’i’ E = (1110 ≪ 1) | 1101 = 1101

ǫ
@
    
e|t - 
R
@
@ 0~ l|t - 1 i - ~ f|m -
2 3 4
    


Leena Salmela Multi-Pattern String Matching November 1st 2007 18 / 22


Filtering Approach Filtering

Shift-Or Character Class Filter: Searching

Matching against the text: “ttime”

E = 1111
Read ’t’ E = (1111 ≪ 1) | 0110 = 1110
Read ’t’ E = (1110 ≪ 1) | 0110 = 1110
Read ’i’ E = (1110 ≪ 1) | 1101 = 1101
Read ’m’ E = (1101 ≪ 1) | 1011 = 1011

ǫ
@
    
f|m - ~ e|t - 
R
@
@ 0~ l|t - 1 i -
2 3 4
    


Leena Salmela Multi-Pattern String Matching November 1st 2007 18 / 22


Filtering Approach Filtering

Shift-Or Character Class Filter: Searching

Matching against the text: “ttime”

E = 1111
Read ’t’ E = (1111 ≪ 1) | 0110 = 1110
Read ’t’ E = (1110 ≪ 1) | 0110 = 1110
Read ’i’ E = (1110 ≪ 1) | 1101 = 1101
Read ’m’ E = (1101 ≪ 1) | 1011 = 1011
Read ’e’ E = (1011 ≪ 1) | 0111 = 0111
ǫ
@
    
e|t - 
R
@
@ 0~ l|t - 1 i -
2
f|m -
3 4~
    


Leena Salmela Multi-Pattern String Matching November 1st 2007 18 / 22


Experimental Results

Outline

Problem Definition and Motivation

Previous Algorithms

Filtering Approach
Verification
Filtering

Experimental Results

Leena Salmela Multi-Pattern String Matching November 1st 2007 19 / 22


Experimental Results

Experimental Results

100
agrep
SBOM tables
RKBT
HG
SOG
BG
10
Runtime (s)

0.1
100 1000 10000 100000
Number of patterns

m = 8, σ = 256, random data, q = 2...3

Leena Salmela Multi-Pattern String Matching November 1st 2007 20 / 22


Experimental Results

Experimental Results

10
SBOM tables
RKBT
HG
SOG
BG
Runtime (s)

0.1
100 1000 10000 100000
Number of patterns

m = 32, σ = 4, DNA data (chromosome from fruitfly genome), q = 6...10

Leena Salmela Multi-Pattern String Matching November 1st 2007 21 / 22


Experimental Results

Summary

◮ Trie-based approaches not practical with very large pattern sets


◮ Filtering approach to multiple pattern matching
◮ Transform patterns to sequences of q-grams
◮ Filter with a character class pattern built from the transformed pattern
set
◮ Verify with a Rabin-Karp style algorithm

Leena Salmela Multi-Pattern String Matching November 1st 2007 22 / 22

You might also like