資料工程 Data Engineering
Pattern Matching
張賢宗
2
Pattern Matching 110/12/07
Outline
• What is Pattern Matching
• The Brute Force Algorithm
• The Knuth-Morris-Pratt(KMP) Algorithm
• The Boyer-Moore Algorithm
• External Pattern Matching
3
Pattern Matching 110/12/07
What is Pattern Matching?
• Given a text string (Long) T and a pattern
(Short) P, find out all the pattern in the text.
▫ T: “It is a good day to take a god damn rest.”
▫ P: “go”
• Applications
▫ Text editor
▫ DNA Sequencing Matching
▫…
4
Pattern Matching 110/12/07
Basic Concepts
• Assume S is a string with length m
• S[i…j] is a fragment between indexes i and j, we
call the fragment as substring of S
• S[0…i] is a prefix of S, where 0<=i<=m-1
• S[i…m-1] is a suffix of S, where 0<=i<=m-1
5
Pattern Matching 110/12/07
Examples
• S: smallpig
• Substring
▫ mal
▫ lpig
• Prefix
▫ smallpig, smallpi, smallp, small, smal, sma, sm, s
• Suffix
▫ smallpig, mallpig, allpig, llpig, lpig, pig, ig, g
6
Pattern Matching 110/12/07
The Brute Force Algorithm
• Check each position in the text T to see if the pat
tern P starts in that position and matches.
T s ma l l p i g
:P a l l
: P al l
: P al l
7
Pattern Matching 110/12/07
Brute Force in C Code
int brute(char *text,char *pattern)
{
int n = strlen(text); // n is length of text
int m = strlen(pattern); // m is length of pattern
int j;
for(int i=0; i <= (n-m); i++) {
j = 0;
while ((j < m) && text[i+j] == pattern[j] )
j++;
if (j == m)
return i; // match at i
}
return -1; // no match
}
8
Pattern Matching 110/12/07
Time Complexity
• Brute force pattern matching runs in time O(mn)
in the worst case.
• But most searches of ordinary text take
O(m+n), which is very quick.
9
Pattern Matching 110/12/07
Alphabets
• Alphabet
▫ The variations of a character in a string
▫ English: a~z, A~Z, 0~9
▫ Computer: ASCII(0) ~ ASCII(255)
▫ Bits: 0, 1
• The brute force algorithm is fast when the alpha
bet of the text is large
• It is slower when the alphabet is small
10
Pattern Matching 110/12/07
Examples of Worst and Average
Cases
• Example of a worst case:
▫ T: “bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbba"
▫ P: “bbbbbba”
• Example of a more average case:
▫ T: “computer science and information engineering"
▫ P: “engine"
11
Pattern Matching 110/12/07
Thinking Over the Problem
• If a mismatch occurs between the text and
pattern P at P[j], what is the most we can s
hift the pattern to avoid wasteful comparis
ons?
12
Pattern Matching 110/12/07
Answer
• The largest prefix of P[0 .. j-1] that is a suffix of P
[1 .. j-1]
13
Pattern Matching 110/12/07
Why
• Let u is the largest prefix of P that is a also a suffi
x of P (P[0 .. k-1] = P[m-k…m-1])
• Assume that we can find a match from T[i+d],
where 0<d<|m|-|u|
• T[i+d… i+|m|-1] = P[0…|m|-|d|-1]
• T[i+d… i+|m|-1] is the suffix of P with length |
m|-d
• |m|-d > |u|, Contradiction.
• We cannot find a such match from T[i+d]
14
Why?
Pattern Matching 110/12/07
15
Pattern Matching 110/12/07
Example
16
Pattern Matching 110/12/07
Example
• Find largest prefix of:
"a b a a b" ( P[0..j-1] )
which is suffix of:
"b a a b" ( p[1 .. j-1] )
• It is "a b"
• Set j = 2 // the new j value
17
Pattern Matching 110/12/07
Failure Function
• KMP preprocesses the pattern to find matches of
prefixes of the pattern with the pattern itself.
• j = mismatch position in P[]
• k = position before the mismatch (k = j-1).
• The failure function F(k) is defined as the size of
the largest prefix of P[0..k] that is also a suffix of
P[1..k].
18
Pattern Matching 110/12/07
Failure Function Example
K=j-1 0 1 2 3 4
• P: "abaaba"
j: 012345 F(j) 0 0 1 1 2
• F(k) is the size of the largest prefix
• In code, F() is represented by an array, like the
table.
19
Pattern Matching 110/12/07
F(4)=2
• Find the size of the largest prefix of P[0..4] that i
s also a suffix of P[1..4]
▫ Find the size largest prefix of "abaab" that
is also a suffix of "baab“
▫ It is "ab“ =2
20
Pattern Matching 110/12/07
KMP in C Code
• Knuth-Morris-Pratt’s algorithm modifies the
brute-force algorithm.
▫ if a mismatch occurs at P[j]
(i.e. P[j] != T[i]), then
k = j-1;
j = F(k); // obtain the new j
21
Pattern Matching 110/12/07
KMP in C Code
while (i < n) {
if (pattern[j] == text[i]) {
if (j == m - 1)
return i - m + 1; // match
i++;
j++;
}
else if (j > 0)
j = fail[j-1];
else
i++;
}
22
Pattern Matching 110/12/07
Analysis of KMP
• KMP runs in optimal time: O(m+n)
• The algorithm never needs to move backwards i
n the input text, T
▫ This makes the algorithm good for processing very
large files that are read in from external devices or
through a network stream.
23
Pattern Matching 110/12/07
Analysis of KMP
• KMP doesn’t work so well as the size of the alpha
bet increases
▫ More chance of a mismatch (more possible misma
tches)
▫ Mismatches tend to occur early in the pattern, but
KMP is faster when the mismatches occur later
24
Pattern Matching 110/12/07
Boyer Moore Algorithm
• The Boyer-Moore pattern matching algorithm is
based on two techniques.
▫ The looking-glass technique
Find P in T by moving backwards through P,
starting at its end.
▫ The character-jump technique
25
Pattern Matching 110/12/07
BM Case 1
26
Pattern Matching 110/12/07
BM Case 2
27
Pattern Matching 110/12/07
BM Case 3
28
Pattern Matching 110/12/07
BM Example
T:
a p a t t e r n m a t c h i n g a l g o r i t h m
1 3 5 11 10 9 8 7
r i t h m r i t h m r i t h m r i t h m
P: r i
2
t h m r i
4
t h m r i
6
t h m
29
Pattern Matching 110/12/07
BM Bad Character Shift Function
• Boyer-Moore’s algorithm preprocesses the
pattern P and the alphabet A to build the shift
values for every character.
30
Pattern Matching 110/12/07
Shift Function Example
• A={a,b,c,d}
• P=“abacab”
x a b c d
BMBC 1 1 2 6
31
Pattern Matching 110/12/07
BM Good Suffix
• Assume that a mismatch occurs between the character x[i]=a of the
pattern and the character y[i+j]=b of the text during an attempt at
position j.
• Then, x[i+1 .. m-1]=y[i+j+1 .. j+m-1]=u and x[i] y[i+j]. The good-
suffix shift consists in aligning the segment y[i+j+1 .. j+m-1]=x[i+1 ..
m-1] with its rightmost occurrence in x that is preceded by a
character different from x[i]
32
Pattern Matching 110/12/07
BM Good Suffix
• If there exists no previous segment, the shift
consists in aligning the longest suffix v of y[i+j+1
.. j+m-1] with a matching prefix of x
33
Pattern Matching 110/12/07
Refine BM Shift Function
• BMBC=BM Bad Character Shift
• BMGS=BM Good Suffix Shift
• Shift(x)= MAX( BMBC, BMGS)
34
Pattern Matching 110/12/07
Analysis of BM
• Boyer-Moore worst case running time is
O(nm)
• Best Case of Moyer-Moore is O(n/m)
• Boyer-Moore is fast when the alphabet is large,
slow when the alphabet is small.
• In practice, the running time BM < KMP < BF
35
Pattern Matching 110/12/07
External Pattern Matching
36
Pattern Matching 110/12/07
External Pattern Matching
37
Pattern Matching 110/12/07
External Pattern Matching
38
Pattern Matching 110/12/07
KMP & BM
• KMP
▫ Small alphabet
▫ Network Stream
▫ External Disk
• BM
▫ Large alphabet
▫ Faster in average