Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
5 views123 pages

04 3 Hashing Search Substring

The document discusses substring searching algorithms, focusing on the naive approach and Rabin-Karp's algorithm. It outlines how to find all occurrences of a substring in a given text and presents the running time complexities of these methods. The Rabin-Karp algorithm improves efficiency by using hashing to reduce the number of direct comparisons needed.

Uploaded by

gammingencoded
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views123 pages

04 3 Hashing Search Substring

The document discusses substring searching algorithms, focusing on the naive approach and Rabin-Karp's algorithm. It outlines how to find all occurrences of a substring in a given text and presents the running time complexities of these methods. The Rabin-Karp algorithm improves efficiency by using hashing to reduce the number of direct comparisons needed.

Uploaded by

gammingencoded
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 123

Hashing:

Substring Search

Michael Levin
Department of Computer Science and Engineering
University of California, San Diego

Data Structures Fundamentals


Algorithms and Data Structures
Outline

1 Find Substring in Text

2 Rabin-Karp’s Algorithm

3 Recurrence Equation for Substring Hashes

4 Improving Running Time


Searching for Substring
Given a text T (website, book, Amazon
product page) and a string P (word, phrase,
sentence), find all occurrences of P in T.
Searching for Substring
Given a text T (website, book, Amazon
product page) and a string P (word, phrase,
sentence), find all occurrences of P in T.

Examples
Specific term in Wikipedia article
Searching for Substring
Given a text T (website, book, Amazon
product page) and a string P (word, phrase,
sentence), find all occurrences of P in T.

Examples
Specific term in Wikipedia article
Gene in a genome
Searching for Substring
Given a text T (website, book, Amazon
product page) and a string P (word, phrase,
sentence), find all occurrences of P in T.

Examples
Specific term in Wikipedia article
Gene in a genome
Detect files infected by virus — code
patterns
Substring Notation
Definition
Denote by S[i..j] the substring of string S
starting in position i and ending in position j.

Examples
If S =“hashing”, then
S[0..3] =“hash”,
S[4..6] =“ing”,
S[2..5] =“shin”.
Find Substring in String
Input: Strings T and P.
Output: All such positions i in T,
0 ≤ i ≤ |T| − |P| that
T[i..i + |P| − 1] = P.
Naive Algorithm

For each position i from 0 to |T| − |P|, check


whether T[i..i + |P| − 1] = P or not.
If yes, append i to the result.
AreEqual(S1, S2)
if |S1| ̸= |S2|:
return False
for i from 0 to |S1| − 1:
if S1[i] ̸= S2[i]:
return False
return True
AreEqual(S1, S2)
if |S1| ̸= |S2|:
return False
for i from 0 to |S1| − 1:
if S1[i] ̸= S2[i]:
return False
return True
AreEqual(S1, S2)
if |S1| ̸= |S2|:
return False
for i from 0 to |S1| − 1:
if S1[i] ̸= S2[i]:
return False
return True
AreEqual(S1, S2)
if |S1| ̸= |S2|:
return False
for i from 0 to |S1| − 1:
if S1[i] ̸= S2[i]:
return False
return True
AreEqual(S1, S2)
if |S1| ̸= |S2|:
return False
for i from 0 to |S1| − 1:
if S1[i] ̸= S2[i]:
return False
return True
AreEqual(S1, S2)
if |S1| ̸= |S2|:
return False
for i from 0 to |S1| − 1:
if S1[i] ̸= S2[i]:
return False
return True
AreEqual(S1, S2)
if |S1| ̸= |S2|:
return False
for i from 0 to |S1| − 1:
if S1[i] ̸= S2[i]:
return False
return True
FindSubstringNaive(T, P)
positions ← empty list
for i from 0 to |T| − |P|:
if AreEqual(T[i..i + |P| − 1], P):
positions.Append(i)
return positions
FindSubstringNaive(T, P)
positions ← empty list
for i from 0 to |T| − |P|:
if AreEqual(T[i..i + |P| − 1], P):
positions.Append(i)
return positions
FindSubstringNaive(T, P)
positions ← empty list
for i from 0 to |T| − |P|:
if AreEqual(T[i..i + |P| − 1], P):
positions.Append(i)
return positions
FindSubstringNaive(T, P)
positions ← empty list
for i from 0 to |T| − |P|:
if AreEqual(T[i..i + |P| − 1], P):
positions.Append(i)
return positions
FindSubstringNaive(T, P)
positions ← empty list
for i from 0 to |T| − |P|:
if AreEqual(T[i..i + |P| − 1], P):
positions.Append(i)
return positions
FindSubstringNaive(T, P)
positions ← empty list
for i from 0 to |T| − |P|:
if AreEqual(T[i..i + |P| − 1], P):
positions.Append(i)
return positions
Running Time
Lemma
Running time of FindPatternNaive(T, P)
is O(|T||P|).
Running Time
Lemma
Running time of FindPatternNaive(T, P)
is O(|T||P|).

Proof
Each AreEqual call is O(|P|)
Running Time
Lemma
Running time of FindPatternNaive(T, P)
is O(|T||P|).

Proof
Each AreEqual call is O(|P|)
|T| − |P| + 1 calls of AreEqual total to
O((|T| − |P| + 1)|P|) = O(|T||P|)
Bad Example
T =“aaa. . . . . . aa” (very long)
P =“aaa. . . ab” (much shorter than T)
Bad Example
T =“aaa. . . . . . aa” (very long)
P =“aaa. . . ab” (much shorter than T)
For each position i in T from 0 to |T| − |P|,
the call to AreEqual has to make all |P|
comparisons, because the difference is always
in the last character.
Bad Example
T =“aaa. . . . . . aa” (very long)
P =“aaa. . . ab” (much shorter than T)
For each position i in T from 0 to |T| − |P|,
the call to AreEqual has to make all |P|
comparisons, because the difference is always
in the last character.
Thus, in this case the naive algorithm runs in
time Θ(|T||P|).
Outline

1 Find Substring in Text

2 Rabin-Karp’s Algorithm

3 Recurrence Equation for Substring Hashes

4 Improving Running Time


Rabin-Karp’s Algorithm

Compare P with all substrings S of T of


length |P|
Rabin-Karp’s Algorithm

Compare P with all substrings S of T of


length |P|
Idea: use hashing to make the
comparisons faster
Comparing Hashes
If h(P) ̸= h(S), then definitely P ̸= S
Comparing Hashes
If h(P) ̸= h(S), then definitely P ̸= S
If h(P) = h(S), call AreEqual(P, S) to
check whether P = S or not
Comparing Hashes
If h(P) ̸= h(S), then definitely P ̸= S
If h(P) = h(S), call AreEqual(P, S) to
check whether P = S or not
Use polynomial hash family Pp with
prime p
Comparing Hashes
If h(P) ̸= h(S), then definitely P ̸= S
If h(P) = h(S), call AreEqual(P, S) to
check whether P = S or not
Use polynomial hash family Pp with
prime p
If P ̸= S, the probability
Pr[h(P) = h(S)] of collision is at most
|P|
p for polynomial hashing — can be
made small by choosing very large
prime p
RabinKarp(T, P)
p ← big prime, x ← random(1, p − 1)
positions ← empty list
pHash ← PolyHash(P, p, x)
for i from 0 to |T| − |P|:
tHash ← PolyHash(T[i..i + |P| − 1], p, x)
if pHash ̸= tHash:
continue
if AreEqual(T[i..i + |P| − 1], P):
positions.Append(i)
return positions
RabinKarp(T, P)
p ← big prime, x ← random(1, p − 1)
positions ← empty list
pHash ← PolyHash(P, p, x)
for i from 0 to |T| − |P|:
tHash ← PolyHash(T[i..i + |P| − 1], p, x)
if pHash ̸= tHash:
continue
if AreEqual(T[i..i + |P| − 1], P):
positions.Append(i)
return positions
RabinKarp(T, P)
p ← big prime, x ← random(1, p − 1)
positions ← empty list
pHash ← PolyHash(P, p, x)
for i from 0 to |T| − |P|:
tHash ← PolyHash(T[i..i + |P| − 1], p, x)
if pHash ̸= tHash:
continue
if AreEqual(T[i..i + |P| − 1], P):
positions.Append(i)
return positions
RabinKarp(T, P)
p ← big prime, x ← random(1, p − 1)
positions ← empty list
pHash ← PolyHash(P, p, x)
for i from 0 to |T| − |P|:
tHash ← PolyHash(T[i..i + |P| − 1], p, x)
if pHash ̸= tHash:
continue
if AreEqual(T[i..i + |P| − 1], P):
positions.Append(i)
return positions
RabinKarp(T, P)
p ← big prime, x ← random(1, p − 1)
positions ← empty list
pHash ← PolyHash(P, p, x)
for i from 0 to |T| − |P|:
tHash ← PolyHash(T[i..i + |P| − 1], p, x)
if pHash ̸= tHash:
continue
if AreEqual(T[i..i + |P| − 1], P):
positions.Append(i)
return positions
RabinKarp(T, P)
p ← big prime, x ← random(1, p − 1)
positions ← empty list
pHash ← PolyHash(P, p, x)
for i from 0 to |T| − |P|:
tHash ← PolyHash(T[i..i + |P| − 1], p, x)
if pHash ̸= tHash:
continue
if AreEqual(T[i..i + |P| − 1], P):
positions.Append(i)
return positions
RabinKarp(T, P)
p ← big prime, x ← random(1, p − 1)
positions ← empty list
pHash ← PolyHash(P, p, x)
for i from 0 to |T| − |P|:
tHash ← PolyHash(T[i..i + |P| − 1], p, x)
if pHash ̸= tHash:
continue
if AreEqual(T[i..i + |P| − 1], P):
positions.Append(i)
return positions
RabinKarp(T, P)
p ← big prime, x ← random(1, p − 1)
positions ← empty list
pHash ← PolyHash(P, p, x)
for i from 0 to |T| − |P|:
tHash ← PolyHash(T[i..i + |P| − 1], p, x)
if pHash ̸= tHash:
continue
if AreEqual(T[i..i + |P| − 1], P):
positions.Append(i)
return positions
RabinKarp(T, P)
p ← big prime, x ← random(1, p − 1)
positions ← empty list
pHash ← PolyHash(P, p, x)
for i from 0 to |T| − |P|:
tHash ← PolyHash(T[i..i + |P| − 1], p, x)
if pHash ̸= tHash:
continue
if AreEqual(T[i..i + |P| − 1], P):
positions.Append(i)
return positions
RabinKarp(T, P)
p ← big prime, x ← random(1, p − 1)
positions ← empty list
pHash ← PolyHash(P, p, x)
for i from 0 to |T| − |P|:
tHash ← PolyHash(T[i..i + |P| − 1], p, x)
if pHash ̸= tHash:
continue
if AreEqual(T[i..i + |P| − 1], P):
positions.Append(i)
return positions
RabinKarp(T, P)
p ← big prime, x ← random(1, p − 1)
positions ← empty list
pHash ← PolyHash(P, p, x)
for i from 0 to |T| − |P|:
tHash ← PolyHash(T[i..i + |P| − 1], p, x)
if pHash ̸= tHash:
continue
if AreEqual(T[i..i + |P| − 1], P):
positions.Append(i)
return positions
False Alarms
“False alarm” is the event when P is
compared with a substring S of T, but
P ̸= S.
False Alarms
“False alarm” is the event when P is
compared with a substring S of T, but
P ̸= S.
|P|
The probability of “false alarm” is at most p
False Alarms
“False alarm” is the event when P is
compared with a substring S of T, but
P ̸= S.
|P|
The probability of “false alarm” is at most p

On average, the total number of “false


alarms” will be (|T|−|P|+1)|P|
p , which can be
made small by selecting p ≫ |T||P|.
Running Time without AreEqual

h(P) is computed in O(|P|)


Running Time without AreEqual

h(P) is computed in O(|P|)


h(T[i..i + |P| − 1]) is computed in
O(|P|), |T| − |P| + 1 times
Running Time without AreEqual

h(P) is computed in O(|P|)


h(T[i..i + |P| − 1]) is computed in
O(|P|), |T| − |P| + 1 times
O(|P|) + O((|T| − |P| + 1)|P|) =
O(|T||P|)
AreEqual Running Time

AreEqual is computed in O(|P|)


AreEqual Running Time

AreEqual is computed in O(|P|)


AreEqual is called only when
h(P) = h(T[i..i + |P| − 1]), meaning
that either an occurrence of P is found
or a “false alarm” happened
AreEqual Running Time

AreEqual is computed in O(|P|)


AreEqual is called only when
h(P) = h(T[i..i + |P| − 1]), meaning
that either an occurrence of P is found
or a “false alarm” happened
By selecting p ≫ |T||P| we make the
number of “false alarms” negligible
Total Running Time

If P is found q times in T, then total


time spent in AreEqual is on average
O((q + (|T|−|P|+1)|P|
p )|P|) = O(q|P|) for
p ≫ |T||P|
Total Running Time

If P is found q times in T, then total


time spent in AreEqual is on average
O((q + (|T|−|P|+1)|P|
p )|P|) = O(q|P|) for
p ≫ |T||P|
Total running time is on average
O(|T||P|) + O(q|P|) = O(|T||P|) as
q ≤ |T|
Analysis
O(|T||P|) is the same as running time
of the Naive algorithm, but it can be
improved!
Analysis
O(|T||P|) is the same as running time
of the Naive algorithm, but it can be
improved!
The second summand O(q|P|) is
unavoidable as we need to check each
of the q occurrences of |P| in |T|
Analysis
O(|T||P|) is the same as running time
of the Naive algorithm, but it can be
improved!
The second summand O(q|P|) is
unavoidable as we need to check each
of the q occurrences of |P| in |T|
The first summand O(|T||P|) is so big
because we compute hash of each
substring of |T| separately
Analysis
O(|T||P|) is the same as running time
of the Naive algorithm, but it can be
improved!
The second summand O(q|P|) is
unavoidable as we need to check each
of the q occurrences of |P| in |T|
The first summand O(|T||P|) is so big
because we compute hash of each
substring of |T| separately
This can be optimized — see next video
Outline

1 Find Substring in Text

2 Rabin-Karp’s Algorithm

3 Recurrence Equation for Substring Hashes

4 Improving Running Time


Idea
Polynomial hash:
|S|−1

h(S) = S[i]xi mod p
i=0
Idea
Polynomial hash:
|S|−1

h(S) = S[i]xi mod p
i=0

Idea: polynomial hashes of two consecutive


substrings of T are very similar
Idea
Polynomial hash:
|S|−1

h(S) = S[i]xi mod p
i=0

Idea: polynomial hashes of two consecutive


substrings of T are very similar
For each i, denote h(T[i..i + |P| − 1]) by H[i]
Consecutive substrings
T= b e a c h
encode(T) = 1 4 0 2 7 |P| = 3
Consecutive substrings
T= b e a c h
encode(T) = 1 4 0 2 7 |P| = 3
h("ach") =
Consecutive substrings
T= b e a c h
encode(T) = 1 4 0 2 7 |P| = 3
h("ach") = 1 x x2
Consecutive substrings
T= b e a c h
encode(T) = 1 4 0 2 7 |P| = 3
h("ach") = 0 2x 7x2
Consecutive substrings
T= b e a c h
encode(T) = 1 4 0 2 7 |P| = 3
h("ach") = 0 +2x+7x2
Consecutive substrings
T= b e a c h
encode(T) = 1 4 0 2 7 |P| = 3
h("ach") = 0 +2x+7x2
h("eac") =
Consecutive substrings
T= b e a c h
encode(T) = 1 4 0 2 7 |P| = 3
h("ach") = 0 +2x+7x2
h("eac") = 1 x x2
Consecutive substrings
T= b e a c h
encode(T) = 1 4 0 2 7 |P| = 3
h("ach") = 0 +2x+7x2
h("eac") = 4 0 2x2
Consecutive substrings
T= b e a c h
encode(T) = 1 4 0 2 7 |P| = 3
h("ach") = 0 +2x+7x2
h("eac") = 4 + 0+2x2
Consecutive substrings
T= b e a c h
encode(T) = 1 4 0 2 7 |P| = 3
h("ach") = 0 +2x+7x2
·x ·x

h("eac") = 4 + 0+2x2
Consecutive substrings
T= b e a c h
encode(T) = 1 4 0 2 7 |P| = 3
h("ach") = 0 +2x+7x2
·x ·x

h("eac") = 4 + 0+2x2
H[2] = h("ach") = 0 + 2x + 7x2
Consecutive substrings
T= b e a c h
encode(T) = 1 4 0 2 7 |P| = 3
h("ach") = 0 +2x+7x2
·x ·x

h("eac") = 4 + 0+2x2
H[2] = h("ach") = 0 + 2x + 7x2
H[1] = h("eac") = 4 + 0x + 2x2 =
Consecutive substrings
T= b e a c h
encode(T) = 1 4 0 2 7 |P| = 3
h("ach") = 0 +2x+7x2
·x ·x

h("eac") = 4 + 0+2x2
H[2] = h("ach") = 0 + 2x + 7x2
H[1] = h("eac") = 4 + 0x + 2x2 =
= 4 + x(0 + 2x) =
Consecutive substrings
T= b e a c h
encode(T) = 1 4 0 2 7 |P| = 3
h("ach") = 0 +2x+7x2
·x ·x

h("eac") = 4 + 0+2x2
H[2] = h("ach") = 0 + 2x + 7x2
H[1] = h("eac") = 4 + 0x + 2x2 =
= 4 + x(0 + 2x) =
= 4 + x(0 + 2x + 7x2) − 7x3 =
Consecutive substrings
T= b e a c h
encode(T) = 1 4 0 2 7 |P| = 3
h("ach") = 0 +2x+7x2
·x ·x

h("eac") = 4 + 0+2x2
H[2] = h("ach") = 0 + 2x + 7x2
H[1] = h("eac") = 4 + 0x + 2x2 =
= 4 + x(0 + 2x) =
= 4 + x(0 + 2x + 7x2) − 7x3 =
= xH[2] + 4 − 7x3
Recurrence Equation for H[i]

i+|P|
H[i + 1] = T[j]xj−i−1 mod p
j=i+1
Recurrence Equation for H[i]

i+|P|
H[i + 1] = T[j]xj−i−1 mod p
j=i+1

i+|P|−1
H[i] = T[j]xj−i mod p =
j=i
Recurrence Equation for H[i]

i+|P|
H[i + 1] = T[j]xj−i−1 mod p
j=i+1

i+|P|−1
H[i] = T[j]xj−i mod p =
j=i

i+|P|
= T[j]xj−i + T[i] − T[i + |P|]x|P| mod p =
j=i+1
Recurrence Equation for H[i]

i+|P|
H[i + 1] = T[j]xj−i−1 mod p
j=i+1

i+|P|−1
H[i] = T[j]xj−i mod p =
j=i

i+|P|
= T[j]xj−i + T[i] − T[i + |P|]x|P| mod p =
j=i+1

i+|P|
=x T[j]xj−i−1 + (T[i] − T[i + |P|]x|P| ) mod p
j=i+1
Recurrence Equation for H[i]

i+|P|
H[i + 1] = T[j]xj−i−1 mod p
j=i+1

i+|P|−1
H[i] = T[j]xj−i mod p =
j=i

i+|P|
= T[j]xj−i + T[i] − T[i + |P|]x|P| mod p =
j=i+1

i+|P|
=x T[j]xj−i−1 + (T[i] − T[i + |P|]x|P| ) mod p
j=i+1

H[i] = xH[i + 1] + (T[i] − T[i + |P|]x|P| ) mod p


Using Recurrence Equation
H[i] = xH[i + 1] + (T[i] − T[i + |P|]x|P|) mod p
Using Recurrence Equation
H[i] = xH[i + 1] + (T[i] − T[i + |P|]x|P|) mod p

x|P| can be computed once and saved


Using Recurrence Equation
H[i] = xH[i + 1] + (T[i] − T[i + |P|]x|P|) mod p

x|P| can be computed once and saved


Using this recurrence equation, H[i] can
be computed in O(1) given H[i + 1] and
x|P|
Using Recurrence Equation
H[i] = xH[i + 1] + (T[i] − T[i + |P|]x|P|) mod p

x|P| can be computed once and saved


Using this recurrence equation, H[i] can
be computed in O(1) given H[i + 1] and
x|P|
See next video to learn how this
improves the running time of
Rabin-Karp
Outline

1 Find Substring in Text

2 Rabin-Karp’s Algorithm

3 Recurrence Equation for Substring Hashes

4 Improving Running Time


Use Precomputation

Use the recurrence equation to


precompute all hashes of substrings of
|T| of length equal to |P|
Then proceed same way as the original
Rabin-Karp algorithm implementation
PrecomputeHashes(T, |P|, p, x)
H ← array of length |T| − |P| + 1
S ← T[|T| − |P|..|T| − 1]
H[|T| − |P|] ← PolyHash(S, p, x)
y←1
for i from 1 to |P|:
y ← (y · x) mod p
for i from |T| − |P| − 1 down to 0:
H[i] ← (xH[i + 1] + T[i] − yT[i + |P|]) mod p
return H
PrecomputeHashes(T, |P|, p, x)
H ← array of length |T| − |P| + 1
S ← T[|T| − |P|..|T| − 1]
H[|T| − |P|] ← PolyHash(S, p, x)
y←1
for i from 1 to |P|:
y ← (y · x) mod p
for i from |T| − |P| − 1 down to 0:
H[i] ← (xH[i + 1] + T[i] − yT[i + |P|]) mod p
return H
PrecomputeHashes(T, |P|, p, x)
H ← array of length |T| − |P| + 1
S ← T[|T| − |P|..|T| − 1]
H[|T| − |P|] ← PolyHash(S, p, x)
y←1
for i from 1 to |P|:
y ← (y · x) mod p
for i from |T| − |P| − 1 down to 0:
H[i] ← (xH[i + 1] + T[i] − yT[i + |P|]) mod p
return H
PrecomputeHashes(T, |P|, p, x)
H ← array of length |T| − |P| + 1
S ← T[|T| − |P|..|T| − 1]
H[|T| − |P|] ← PolyHash(S, p, x)
y←1
for i from 1 to |P|:
y ← (y · x) mod p
for i from |T| − |P| − 1 down to 0:
H[i] ← (xH[i + 1] + T[i] − yT[i + |P|]) mod p
return H

O(|P|
PrecomputeHashes(T, |P|, p, x)
H ← array of length |T| − |P| + 1
S ← T[|T| − |P|..|T| − 1]
H[|T| − |P|] ← PolyHash(S, p, x)
y←1
for i from 1 to |P|:
y ← (y · x) mod p
for i from |T| − |P| − 1 down to 0:
H[i] ← (xH[i + 1] + T[i] − yT[i + |P|]) mod p
return H

O(|P|
PrecomputeHashes(T, |P|, p, x)
H ← array of length |T| − |P| + 1
S ← T[|T| − |P|..|T| − 1]
H[|T| − |P|] ← PolyHash(S, p, x)
y←1
for i from 1 to |P|:
y ← (y · x) mod p
for i from |T| − |P| − 1 down to 0:
H[i] ← (xH[i + 1] + T[i] − yT[i + |P|]) mod p
return H

O(|P|
PrecomputeHashes(T, |P|, p, x)
H ← array of length |T| − |P| + 1
S ← T[|T| − |P|..|T| − 1]
H[|T| − |P|] ← PolyHash(S, p, x)
y←1
for i from 1 to |P|:
y ← (y · x) mod p
for i from |T| − |P| − 1 down to 0:
H[i] ← (xH[i + 1] + T[i] − yT[i + |P|]) mod p
return H

O(|P|+|P|
PrecomputeHashes(T, |P|, p, x)
H ← array of length |T| − |P| + 1
S ← T[|T| − |P|..|T| − 1]
H[|T| − |P|] ← PolyHash(S, p, x)
y←1
for i from 1 to |P|:
y ← (y · x) mod p
for i from |T| − |P| − 1 down to 0:
H[i] ← (xH[i + 1] + T[i] − yT[i + |P|]) mod p
return H

O(|P|+|P|
PrecomputeHashes(T, |P|, p, x)
H ← array of length |T| − |P| + 1
S ← T[|T| − |P|..|T| − 1]
H[|T| − |P|] ← PolyHash(S, p, x)
y←1
for i from 1 to |P|:
y ← (y · x) mod p
for i from |T| − |P| − 1 down to 0:
H[i] ← (xH[i + 1] + T[i] − yT[i + |P|]) mod p
return H

O(|P|+|P|+|T| − |P|)
PrecomputeHashes(T, |P|, p, x)
H ← array of length |T| − |P| + 1
S ← T[|T| − |P|..|T| − 1]
H[|T| − |P|] ← PolyHash(S, p, x)
y←1
for i from 1 to |P|:
y ← (y · x) mod p
for i from |T| − |P| − 1 down to 0:
H[i] ← (xH[i + 1] + T[i] − yT[i + |P|]) mod p
return H

O(|P|+|P|+|T| − |P|)= O(|T| + |P|)


Precomputing H

PolyHash is called once — O(|P|)


x|P| is computed in O(|P|)
All values of H are computed in
O(|T| − |P|)
Total precomputation time O(|T| + |P|)
RabinKarp(T, P)
p ← big prime, x ← random(1, p − 1)
positions ← empty list
pHash ← PolyHash(P, p, x)
H ← PrecomputeHashes(T, |P|, p, x)
for i from 0 to |T| − |P|:
if pHash ̸= H[i]:
continue
if AreEqual(T[i..i + |P| − 1], P):
positions.Append(i)
return positions
RabinKarp(T, P)
p ← big prime, x ← random(1, p − 1)
positions ← empty list
pHash ← PolyHash(P, p, x)
H ← PrecomputeHashes(T, |P|, p, x)
for i from 0 to |T| − |P|:
if pHash ̸= H[i]:
continue
if AreEqual(T[i..i + |P| − 1], P):
positions.Append(i)
return positions
RabinKarp(T, P)
p ← big prime, x ← random(1, p − 1)
positions ← empty list
pHash ← PolyHash(P, p, x)
H ← PrecomputeHashes(T, |P|, p, x)
for i from 0 to |T| − |P|:
if pHash ̸= H[i]:
continue
if AreEqual(T[i..i + |P| − 1], P):
positions.Append(i)
return positions
RabinKarp(T, P)
p ← big prime, x ← random(1, p − 1)
positions ← empty list
pHash ← PolyHash(P, p, x)
H ← PrecomputeHashes(T, |P|, p, x)
for i from 0 to |T| − |P|:
if pHash ̸= H[i]:
continue
if AreEqual(T[i..i + |P| − 1], P):
positions.Append(i)
return positions
RabinKarp(T, P)
p ← big prime, x ← random(1, p − 1)
positions ← empty list
pHash ← PolyHash(P, p, x)
H ← PrecomputeHashes(T, |P|, p, x)
for i from 0 to |T| − |P|:
if pHash ≠ H[i]:
continue
if AreEqual(T[i..i + |P| − 1], P):
positions.Append(i)
return positions
RabinKarp(T, P)
p ← big prime, x ← random(1, p − 1)
positions ← empty list
pHash ← PolyHash(P, p, x)
H ← PrecomputeHashes(T, |P|, p, x)
for i from 0 to |T| − |P|:
if pHash ̸= H[i]:
continue
if AreEqual(T[i..i + |P| − 1], P):
positions.Append(i)
return positions
RabinKarp(T, P)
p ← big prime, x ← random(1, p − 1)
positions ← empty list
pHash ← PolyHash(P, p, x)
H ← PrecomputeHashes(T, |P|, p, x)
for i from 0 to |T| − |P|:
if pHash ̸= H[i]:
continue
if AreEqual(T[i..i + |P| − 1], P):
positions.Append(i)
return positions
RabinKarp(T, P)
p ← big prime, x ← random(1, p − 1)
positions ← empty list
pHash ← PolyHash(P, p, x)
H ← PrecomputeHashes(T, |P|, p, x)
for i from 0 to |T| − |P|:
if pHash ̸= H[i]:
continue
if AreEqual(T[i..i + |P| − 1], P):
positions.Append(i)
return positions
RabinKarp(T, P)
p ← big prime, x ← random(1, p − 1)
positions ← empty list
pHash ← PolyHash(P, p, x)
H ← PrecomputeHashes(T, |P|, p, x)
for i from 0 to |T| − |P|:
if pHash ̸= H[i]:
continue
if AreEqual(T[i..i + |P| − 1], P):
positions.Append(i)
return positions
RabinKarp(T, P)
p ← big prime, x ← random(1, p − 1)
positions ← empty list
pHash ← PolyHash(P, p, x)
H ← PrecomputeHashes(T, |P|, p, x)
for i from 0 to |T| − |P|:
if pHash ̸= H[i]:
continue
if AreEqual(T[i..i + |P| − 1], P):
positions.Append(i)
return positions
RabinKarp(T, P)
p ← big prime, x ← random(1, p − 1)
positions ← empty list
pHash ← PolyHash(P, p, x)
H ← PrecomputeHashes(T, |P|, p, x)
for i from 0 to |T| − |P|:
if pHash ̸= H[i]:
continue
if AreEqual(T[i..i + |P| − 1], P):
positions.Append(i)
return positions
Improved Running Time
h(P) is computed in O(|P|)
Improved Running Time
h(P) is computed in O(|P|)
PrecomputeHashes in O(|T| + |P|)
Improved Running Time
h(P) is computed in O(|P|)
PrecomputeHashes in O(|T| + |P|)
Total time spent in AreEqual is
O(q|P|) on average (for large enough
prime p), where q is the number of
occurrences of P in T
Improved Running Time
h(P) is computed in O(|P|)
PrecomputeHashes in O(|T| + |P|)
Total time spent in AreEqual is
O(q|P|) on average (for large enough
prime p), where q is the number of
occurrences of P in T
Total running time on average
O(|T| + (q + 1)|P|)
Improved Running Time
h(P) is computed in O(|P|)
PrecomputeHashes in O(|T| + |P|)
Total time spent in AreEqual is
O(q|P|) on average (for large enough
prime p), where q is the number of
occurrences of P in T
Total running time on average
O(|T| + (q + 1)|P|)
Usually q is small, so this is much less
than O(|T||P|)
Conclusion
Hash tables are useful for storing Sets
and Maps
Conclusion
Hash tables are useful for storing Sets
and Maps
Possible to search and modify hash
tables in O(1) on average!
Conclusion
Hash tables are useful for storing Sets
and Maps
Possible to search and modify hash
tables in O(1) on average!
Must use good hash families and
randomization
Conclusion
Hash tables are useful for storing Sets
and Maps
Possible to search and modify hash
tables in O(1) on average!
Must use good hash families and
randomization
Hashes are also useful while working
with strings and texts
Conclusion
Hash tables are useful for storing Sets
and Maps
Possible to search and modify hash
tables in O(1) on average!
Must use good hash families and
randomization
Hashes are also useful while working
with strings and texts
There are many more applications,
including blockchain — see next video!

You might also like