Data Structure and Algorithms [CO2003]
Chapter 9 - Hash
Lecturer: Duc Dung Nguyen, PhD.
Contact:
[email protected]Faculty of Computer Science and Engineering
Hochiminh city University of Technology
Contents
1. Basic concepts
2. Hash functions
3. Collision resolution
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Data Structure and Algorithms [CO2003] 1 / 44
Outcomes
• L.O.5.1 - Depict the following concepts: hashing table, key, collision, and collision
resolution.
• L.O.5.2 - Describe hashing functions using pseudocode and give examples to show their
algorithms.
• L.O.5.3 - Describe collision resolution methods using pseudocode and give examples to
show their algorithms.
• L.O.5.4 - Implement hashing tables using C/C++.
• L.O.5.5 - Analyze the complexity and develop experiment (program) to evaluate methods
supplied for hashing tables.
• L.O.1.2 - Analyze algorithms and use Big-O notation to characterize the computational
complexity of algorithms composed by using the following control structures: sequence,
branching, and iteration (not recursion).
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Data Structure and Algorithms [CO2003] 2 / 44
Basic concepts
Basic concepts
• Sequential search: O(n)
• Binary search: O(log2 n)
→ Requiring several key comparisons before the target is found.
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Data Structure and Algorithms [CO2003] 3 / 44
Basic concepts
Search complexity:
Size Binary Sequential (Av- Sequential (Worst
erage) Case)
16 4 8 16
50 6 25 50
256 8 128 256
1,000 10 500 1,000
10,000 14 5,000 10,000
100,000 17 50,000 100,000
1,000,000 20 500,000 1,000,000
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Data Structure and Algorithms [CO2003] 4 / 44
Basic concepts
Is there a search algorithm whose complexity is O(1)?
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Data Structure and Algorithms [CO2003] 5 / 44
Basic concepts
Is there a search algorithm whose complexity is O(1)?
YES
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Data Structure and Algorithms [CO2003] 5 / 44
Basic concepts
Figure 1: Each key has only one address
Lecturer: Duc Dung Nguyen, PhD. Contact:
[email protected] Data Structure and Algorithms [CO2003] 6 / 44
Basic concepts
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Data Structure and Algorithms [CO2003] 7 / 44
Basic concepts
• Home address: address produced by a hash function.
• Prime area: memory that contains all the home addresses.
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Data Structure and Algorithms [CO2003] 8 / 44
Basic concepts
• Home address: address produced by a hash function.
• Prime area: memory that contains all the home addresses.
• Synonyms: a set of keys that hash to the same location.
• Collision: the location of the data to be inserted is already occupied by the synonym data.
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Data Structure and Algorithms [CO2003] 8 / 44
Basic concepts
• Home address: address produced by a hash function.
• Prime area: memory that contains all the home addresses.
• Synonyms: a set of keys that hash to the same location.
• Collision: the location of the data to be inserted is already occupied by the synonym data.
• Ideal hashing:
• No location collision
• Compact address space
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Data Structure and Algorithms [CO2003] 8 / 44
Basic concepts
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Data Structure and Algorithms [CO2003] 9 / 44
Basic concepts
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Data Structure and Algorithms [CO2003] 10 / 44
Basic concepts
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Data Structure and Algorithms [CO2003] 11 / 44
Basic concepts
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Data Structure and Algorithms [CO2003] 12 / 44
Hash functions
Hash functions
• Direct hashing
• Modulo division
• Digit extraction
• Mid-square
• Folding
• Rotation
• Pseudo-random
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Data Structure and Algorithms [CO2003] 13 / 44
Direct Hashing
The address is the key itself:
hash(Key) = Key
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Data Structure and Algorithms [CO2003] 14 / 44
Direct Hashing
• Advantage: there is no collision.
• Disadvantage: the address space (storage size) is as large as the key space.
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Data Structure and Algorithms [CO2003] 15 / 44
Modulo division
Address = Key mod listSize
• Fewer collisions if listSize is a prime number.
• Example:
Numbering system to handle 1,000,000 employees
Data space to store up to 300 employees
hash(121267) = 121267 mod 307 = 2
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Data Structure and Algorithms [CO2003] 16 / 44
Digit extraction
Address = selected digits f rom Key
Example:
379452→394
121267→112
378845→388
160252→102
045128→051
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Data Structure and Algorithms [CO2003] 17 / 44
Mid-square
Address = middle digits of Key 2
Example:
9452 * 9452 = 89340304→3403
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Data Structure and Algorithms [CO2003] 18 / 44
Mid-square
• Disadvantage: the size of the Key 2 is too large.
• Variations: use only a portion of the key.
Example:
379452: 379 * 379 = 143641→364 121267: 121 * 121 = 014641→464 045128: 045 *
045 = 002025→202
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Data Structure and Algorithms [CO2003] 19 / 44
Folding
The key is divided into parts whose size matches the address size.
Example:
Key = 123|456|789
fold shift
123 + 456 + 789 = 1368
→ 368
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Data Structure and Algorithms [CO2003] 20 / 44
Folding
The key is divided into parts whose size matches the address size.
Example:
Key = 123|456|789
fold shift
123 + 456 + 789 = 1368
→ 368
fold boundary
321 + 456 + 987 = 1764
→ 764
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Data Structure and Algorithms [CO2003] 20 / 44
Rotation
• Hashing keys that are identical except for the last character may create synonyms.
• The key is rotated before hashing.
original key rotated key
600101 160010
600102 260010
600103 360010
600104 460010
600105 560010
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Data Structure and Algorithms [CO2003] 21 / 44
Rotation
• Used in combination with fold shift.
original key rotated key
600101 → 62 160010 → 26
600102 → 63 260010 → 36
600103 → 64 360010 → 46
600104 → 65 460010 → 56
600105 → 66 560010 → 66
Spreading the data more evenly across the address space.
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Data Structure and Algorithms [CO2003] 22 / 44
Pseudo-random
For maximum efficiency, a and c should be prime numbers.
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Data Structure and Algorithms [CO2003] 23 / 44
Pseudo-random
Example:
Key = 121267
a = 17
c=7
listSize = 307
Address = ((17*121267 + 7) mod 307
= (2061539 + 7) mod 307
= 2061546 mod 307
= 41
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Data Structure and Algorithms [CO2003] 24 / 44
Collision resolution
Collision resolution
• Except for the direct hashing, none of the others are one-to-one mapping
→ Requiring collision resolution methods
• Each collision resolution method can be used independently with each hash function
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Data Structure and Algorithms [CO2003] 25 / 44
Collision resolution
• Open addressing
• Linked list resolution
• Bucket hashing
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Data Structure and Algorithms [CO2003] 26 / 44
Open addressing
When a collision occurs, an unoccupied element is searched for placing the new element in.
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Data Structure and Algorithms [CO2003] 27 / 44
Open addressing
Hash function:
h : U → {0, 1, 2, ..., m − 1}
set of keys addresses
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Data Structure and Algorithms [CO2003] 28 / 44
Open addressing
Hash and probe function:
hp : U × {0, 1, 2, ..., m − 1} → {0, 1, 2, ..., m − 1}
set of keys probe numbers addresses
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Data Structure and Algorithms [CO2003] 29 / 44
Open Addressing
Algorithm hashInsert(ref T <array>, val k <key>)
Inserts key k into table T.
i=0
while i < m do
j = hp(k, i)
if T[j] = nil then
T[j] = k
return j
else
i=i+1
end
end
return error: “hash table overflow”
End hashInsert
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Data Structure and Algorithms [CO2003] 30 / 44
Open Addressing
Algorithm hashSearch(val T <array>, val k <key>)
Searches for key k in table T.
i=0
while i < m do
j = hp(k, i)
if T[j] = k then
return j
else if T[j] = nil then
return nil
else
i=i+1
end
end
return nil
End hashSearch
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Data Structure and Algorithms [CO2003] 31 / 44
Open Addressing
There are different methods:
• Linear probing
• Quadratic probing
• Double hashing
• Key offset
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Data Structure and Algorithms [CO2003] 32 / 44
Linear Probing
• When a home address is occupied, go to the next address (the current address + 1):
hp(k, i) = (h(k) + i) mod m
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Data Structure and Algorithms [CO2003] 33 / 44
Linear Probing
• When a home address is occupied, go to the next address (the current address + 1):
hp(k, i) = (h(k) + i) mod m
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Data Structure and Algorithms [CO2003] 33 / 44
Linear Probing
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Data Structure and Algorithms [CO2003] 34 / 44
Linear Probing
• Advantages:
• quite simple to implement
• data tend to remain near their home address (significant for disk addresses)
• Disadvantages:
• produces primary clustering
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Data Structure and Algorithms [CO2003] 35 / 44
Quadratic Probing
• The address increment is the collision probe number squared:
hp(k, i) = (h(k) + i2 ) mod m
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Data Structure and Algorithms [CO2003] 36 / 44
Quadratic Probing
• Advantages:
• works much better than linear probing
• Disadvantages:
• time required to square numbers
• produces secondary clustering
h(k1 ) = h(k2 ) → hp(k1 , i) = hp(k2 , i)
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Data Structure and Algorithms [CO2003] 37 / 44
Double Hashing
• Using two hash functions:
hp(k, i) = (h1 (k) + ih2 (k)) mod m
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Data Structure and Algorithms [CO2003] 38 / 44
Key Offset
• The new address is a function of the collision address and the key.
of f set = [key/listSize]
newAddress = (collisionAddress + of f set) mod listSize
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Data Structure and Algorithms [CO2003] 39 / 44
Key Offset
• The new address is a function of the collision address and the key.
of f set = [key/listSize]
newAddress = (collisionAddress + of f set) mod listSize
hp(k, i) = (hp(k, i − 1) + [k/m]) mod m
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Data Structure and Algorithms [CO2003] 39 / 44
Open addressing
Hash and probe function:
hp : U × {0, 1, 2, ..., m − 1} → {0, 1, 2, ..., m − 1}
set of keys probe numbers addresses
{hp(k, 0), hp(k, 1), . . . , hp(k, m − 1)} is a permutation of {0, 1, . . . , m − 1}
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Data Structure and Algorithms [CO2003] 40 / 44
Linked List Resolution
• Major disadvantage of Open Addressing: each collision resolution increases the probability
for future collisions.
→ use linked lists to store synonyms
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Data Structure and Algorithms [CO2003] 41 / 44
Linked list resolution
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Data Structure and Algorithms [CO2003] 42 / 44
Bucket hashing
• Hashing data to buckets that can hold multiple pieces of data.
• Each bucket has an address and collisions are postponed until the bucket is full.
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Data Structure and Algorithms [CO2003] 43 / 44
Bucket hashing
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Data Structure and Algorithms [CO2003] 44 / 44