String Matching Problem
Given a text string T of length n and a pattern string P of length m, the exact string matching problem is to find all occurrences of P in T. Example: T=AGCTTGA P=GCT Applications:
Searching keywords in a file Searching engines (like Google and Openfind) Database searching (GenBank)
What is pattern matching?
Problem/issue Finding occurrence of a pattern (string) P in String S and also finding the position in S where the pattern match occurs
Brute Force algorithm
The brute-force pattern matching algorithm compares the pattern P with the text T for each possible shift of P relative to T, *until either a match is found, or *all placements of the pattern have been tried
Brute-force
algorithm brute-force: input: an array of characters, T (the string to be analyzed) , length n an array of characters, P (the pattern to be searched for), length m for i := 0 to n-m do for j := 0 to m-1 do compare T[j] with P[i+j] if not equal, exit the inner loop
Worst O(m*n) Best O(n)
Example
Compare each character of P with S if match continue else shift one position ab c abaabc aba c String S
Pattern p
abaa
Step 1:compare p[1] with S[1] S a b c a b a a b c a b a c
abaa
Step 2: compare p[2] with S[2]
S a b c a b a a b c a b a c
p
abaa
Step 3: compare p[3] with S[3] S a b c a b a a b c a b a c
Mismatch occurs here..
p a b a a
Since mismatch is detected, shift P one position to the Right and perform steps analogous to those from step 1 to step 3. At position where mismatch is detected, shift P one position to the right and repeat matching procedure.
The Knuth-Morris-Pratt Algorithm
Knuth, Morris and Pratt proposed a linear time algorithm for the string matching problem. A matching time of O(n) is achieved by avoiding comparisons with elements of S that have previously been involved in comparison with some element of the pattern p to be matched. i.e., backtracking on the string S never occurs
Components of KMP algorithm
The prefix function, The prefix function, for a pattern encapsulates knowledge about how the pattern matches against shifts of itself. This information can be used to avoid useless shifts of the pattern p. In other words, this enables avoiding backtracking on the string S. The KMP Matcher With string S, pattern p and prefix function as inputs, finds the occurrence of p in S and returns the number of shifts of p after which occurrence is found.
Knuth-Morris-Pratt algorithm
-Algorithm Compute-Prefix-Function(P) 1. m length[T] 2. [1] 0 3. k 0 4. for q 2 to m 5. do while k > 0 and P[k + 1] P[q] 6. do k [k] /*if k = 0 or P[k + 1] = P[q], 7. if P[k + 1] = P[q] going out of the while-loop.*/ 8. then k k + 1 9. [q] k 10. return
Knuth-Morris-Pratt algorithm
-Algorithm KMP-Matcher(T, P) 1. n length[T] 2. m length[P] 3. Compute-Prefix-Function(P) 4. q 0 5. for i 1 to n 6. do while q > 0 and P[q + 1] T[i] 7. do q [q] 8. if P[q + 1] = T[i] 9. then q q + 1 10. if q = m 11. then print pattern occurs with shift i m 12. q [q]
Compute prefix function
P = ababababca, T = ababaababababca [1] = 0 k=0 q = 2, P[k + 1] = P[1] = a, P[q] = P[2] = b, P[k + 1] P[q] [q] k ([2] 0) q = 3, P[k + 1] = P[1] = a, P[q] = P[3] = a, P[k + 1] = P[q] k k + 1, [q] k ([3] 1) k=1 q = 4, P[k + 1] = P[2] = b, P[q] = P[4] = b, P[k + 1] = P[q] k k + 1, [q] k ([4] 2)
k=2 q = 5, P[k + 1] = P[3] = a, P[q] = P[5] = a, P[k + 1] = P[q] k k + 1, [q] k ([5] 3) k=3 q = 6, P[k + 1] = P[4] = b, P[q] = P[6] = b, P[k + 1] = P[q] k k + 1, [q] k ([6] 4) k=4 q = 7, P[k + 1] = P[5] = a, P[q] = P[7] = a, P[k + 1] = P[q] k k + 1, [q] k ([7] 5) k=5 q = 8, P[k + 1] = P[6] = b, P[q] = P[8] = b, P[k + 1] = P[q] k k + 1, [q] k ([8] 6)
k=6 q = 9, P[k + 1] = P[6] = b, P[q] = P[9] = c, P[k + 1] P[q] k [k] (k [6] = 4) P[k + 1] = P[5] = a, P[q] = P[9] = c, P[k + 1] P[q] k [k] (k [4] = 2) P[k + 1] = P[3] = a, P[q] = P[9] = c, P[k + 1] P[q] k [k] (k [2] = 0) k=0 q = 9, P[k + 1] = P[1] = a, P[q] = P[9] = c, P[k + 1] P[q] [q] k ([9] 0) q = 10, P[k + 1] = P[1] = a, P[q] = P[10] = a, P[k + 1] = P[q] k k + 1, [q] k ([10] 1)
After prefix computation, the table is shown below
P = ababababca
1 P[i] a [i] 0
i
P8
2 b 0
3 a 1
4 b 2
5 a 3
c a
6 b 4
7 a 5
8 b 6
9 10 c a 0 1
[8] = 6 [6] = 4 [4] = 2 [2] = 0
a b a b a b a b a b a b a b
P6 P4 P2 P0
a b c a
a b a b
a b
a b a b c a
a b a b a b c a a b a b a b a b c a
Another Example for KMP Algorithm
Next, Search phase computation
Phase 2
First finish the prefix computation
f(41)+1= f(3)+1=0+1=1
Phase 1 matched
f(13-1)+1= 4+1=5