Data Structures and Algorithms 2
Prof. Ahmed Guessoum
The National Higher School of AI
Chapter 5
Hashing
Motivating Example
We want to store a list whose elements
are integers between 1 and 5
We will define an array of size 5, and if
the list has element j, then j is stored in
A[j-1], otherwise A[j-1] contains 0.
Complexity of find operation is O(1)
2
The space for storage is called ``hash table
Ideal hash table data structure is just an array of some
fixed size, TableSize, containing the data items.
A search for an item is performed on some part (i.e.
data member) of the item, called the key.
For example, an item could consist of a string (that
serves as the key) and additional data members (for
instance, a name that is part of a large employee
structure, etc.).
The common convention is to have the table run from 0
to TableSize
3
Hashing
Each key is mapped into some number in the range 0 to
TableSize
The mapping is called a hash function, h, which ideally
simple to compute and
any two distinct keys should get different cells
Since there are a finite number of cells and a very large
supply of keys, this is clearly impossible,
we seek a hash function that
distributes the keys evenly among the cells
4
Hash Functions
Suppose that the hash table has size M
There is a hash function which maps an
-1, and the
element is placed in position p in the hash
table.
The function is called h; the hash value for
key j is h[j]
If h[j] = k, then the element is added to H[k],
i.e. at position k in H.
5
Example of an ideal hash table
How to choose a hash
function?
Khalid 25000
How to decide on the table
Tariq 31250
size?
Aicha 27500
What do we do with
Asma 28200
collisions?
6
Choice of a Hash Function
If the input keys are integers, then Key mod TableSize
is generally a reasonable strategy, unless Key happens
to have some undesirable properties.
One has to be careful in the design of the hash
function.
E.g., suppose tableSize = 10 and that the keys all end in
zero, then the standard hash function is clearly a bad
choice!
It is often a good idea to ensure that the table size is
prime
When the input keys are random integers, then the
above function is not only very simple to compute but
also distributes the keys evenly 7
Choosing a hash function
Usually, the keys are strings; in this case, the hash
function needs to be chosen carefully.
One option is to add up the ASCII values of the
characters in the string.
Consider Example 1 of a hash function
int hash( const string & key, int tableSize )
{
int hashVal = 0;
for( char ch : key )
hashVal += ch;
return hashVal % tableSize;
} 8
The previous hash function is simple to implement and
computes an answer quickly.
However, if the table size is large, the function does not
distribute the keys well (fairly evenly).
E.g., suppose that TableSize = 10,007 (a prime number).
Suppose all the keys are eight or fewer characters long.
Since an ASCII character has an integer value <= 127,
the values produced by the hash function are between 0
and 1,016, which is 127 8.
This is clearly not an even distribution over the hash
table! (About 90% of the table will never be used!)
9
Example 2 of a hash function
int hash( const string & key, int tableSize )
{
return ( key[ 0 ] + 27 * key[ 1 ] + 729 * key[ 2 ] ) % tableSize;
}
27 is the number of English letters + blank char; 729 is
This hash function is easy to compute.
It examines only the first three characters.
If characters are random and table size is 10,007, as before,
then we would expect a reasonably equitable distribution.
In fact, looking up a dictionary, there are only 2,851 not
17576 ( ) combinations. though no collisions, only
28% of the table is actually hashed to. 10
Example 3 of a hash function
unsigned int hash( const string & key, int tableSize )
{
unsigned int hashVal = 0;
for( char ch : key )
hashVal = 37 * hashVal + ch;
return hashVal % tableSize;
}
Involves all characters in the key and can generally be
expected to distribute well (it computes
The code computes a polynomial function (of 37). 11
The hash function takes advantage of the fact that
overflow is allowed and uses unsigned int to avoid
introducing a negative number.
This hash function has a reasonable table distribution,
not necessarily the best.
It does have the merit of extreme simplicity and is
reasonably fast.
If the keys are very long, the hash function will take too
long to compute.
A common practice in this case is not to use all the
characters.
E.g. a street address key: Use a couple of characters
from the street address, a couple from city, and from
zip code. 12
Handling Collisions: Separate Chaining
Separate chaining approach: keep a linked list of all the
elements that hash to the same value.
Suppose that the keys are the first 10 perfect squares and
hashing function is hash(x) = x mod 10
13
Operations using Hashing
Search in Hash Table: use the hash function to
determine which list to traverse. Then search the
appropriate list.
Insertion in Hash Table: check the appropriate list to
see if element is found (if duplicates are expected, an
extra counter data member is incremented). Otherwise,
insert it at front of the list: convenience and likelihood of
frequent access.
Deletion of an element: do the hashing, then delete from
the linked list.
Note: the hash tables in this chapter work only for
objects that provide a hash function and equality
operators (operator== an/or operator!=). Comparables? 14
Hash function implementation
Use of function object template (C++11)
template <typename Key>
class hash
{
public:
size_t operator() ( const Key & k ) const;
};
The type size_t is an unsigned integral type that represents the size
of an object; it is guaranteed to be able to store an array index
On a 32-bit system size_t will take 32 bits, on a 64-bit one 64 bits
15
Default implementations of hash function template
using standard type string:
template <>
class hash<string>
{
public:
size_t operator()( const string & key )
{
size_t hashVal = 0;
for( char ch : key )
hashVal = 37 * hashVal + ch;
return hashVal;
}
};
16
Alternatives to Linked Lists?
Any scheme could be used besides linked lists to
resolve the collisions.
A binary search tree or even another hash table
would work.
If the table is large and the hash function is good,
all the lists should be short so basic separate
chaining makes no attempt to try anything
complicated.
18
Load factor of a hash table
Load factor of a hash table:
= # of elements in the hash table / table size
In previous example, = 1.0.
Usually, a threshold is set on to do the
rehashing: i.e. expanding the table and re-
calculating the hash code of already stored entries
Time required to perform a search = constant time
to evaluate the hash function + time to traverse
the list.
19
HTs without LLs: probing hash tables
Hashing with separate chaining has the disadvantage
of using linked lists can slow the algorithm down
Alternative approach (to resolving collisions with
linked lists) is to try alternative cells until an empty
cell is found.
More formally, cells h0(x), h1(x), h2(x), . . . are tried in
succession, where
hi(x) = (hash(x) + f (i)) mod TableSize, with f(0) = 0.
f is the collision resolution strategy.
All the data go inside the table a bigger table is
needed in this approach
Generally, the load factor should be below = 0.5 20
Linear Probing
Linear probing: f is a linear function of i, typically f (i) = i
trying cells sequentially (with wraparound) in search of an
empty cell
e.g. with hash(x) = x mod 10 and linear probing, insert 89, 18,
49, 58, 69
21
Linear Probing (cont.)
As long as the table is big enough, a free cell can always
be found, but the time to do so can get quite large.
Worse, even if the table is relatively empty, blocks of
occupied cells start forming. This effect, known as
primary clustering, key that hashes into the cluster
will require several attempts to resolve the collision, and
then it will be added to the cluster.
It can be shown that the expected number of probes
using linear probing is roughly 1/2 (1 + 1/ )) for
insertions and unsuccessful searches, and
1/2 (1 + 1/ )) for successful searches. 22
Quadratic probing
Quadratic probing: a collision resolution method that
eliminates the primary clustering problem of linear
probing.
Collision function is quadratic. Popular choice is f (i) =
23
Probing properties
For linear probing, it is a bad idea to let the hash table get
nearly full, because performance degrades.
For quadratic probing, the situation is even more drastic:
There is no guarantee of finding an empty cell once
the table gets more than half full, or
even before the table gets half full if the table size is not
prime.
This is because at most half of the table can be used as
alternative locations to resolve collisions.
Theorem 5.1: If quadratic probing is used, and the table size
is prime, then a new element can always be inserted if the
table is at least half empty.
Code for hash tables using probing strategies 24
Double Hashing
For double hashing, one popular choice is
f (i) = i · hash2(x).
This formula says that we apply a second hash function
to x and probe at a distance hash2(x), 2hash2(x), . . . ,
and so on.
A poor choice of hash2(x) would be disastrous.
For instance, the obvious choice hash2(x) = x mod 9
would not help if 99 were inserted into the input in the
previous examples.
Thus, the function must never evaluate to zero.
It is also important to make sure all cells can be probed
25
A function as hash2(x) = R x mod R), with R a prime
smaller than TableSize, will work well.
Below: same previous example with R = 7
Important reminder: Size of table should be prime!
26
Size = 10 in example was for convenience of mod 10.
Rehashing
If the hash table is close to full, then running time for the
operations will start taking too long, and insertions might
fail if separate chaining with quadratic probing
a hash table of bigger size (~ twice as big) is used
with a new hash function
compute the new hash value for each element of the
original table and insert it in the new table.
The old hash table is subsequently deleted.
This operation is called Rehashing.
It Should be done infrequently. 27
Rehashing example
Insert 13, 15, 24, 6 in hash table of size 7 6
15
with h(x) = x mod 7 (with linear probing)
23
24
If 23 is inserted (linear probing), then
the table > 70% full
13
New table created; 17 is next prime
number about twice as large as 7
New hash function h(x) = x mod 17
Exercise: You can easily check the new hash table with
these data elements. 28
When to rehash?
Rehashing can be implemented in several ways with
quadratic probing
Rehash as soon as the table is half full.
The other extreme is to rehash only when an insertion
fails (even with probing).
A third, middle-of-the-road strategy is to rehash when
the table reaches a certain load factor.
Since performance does degrade as the load factor
increases, the third strategy, implemented with a good
cutoff, could be best.
Implementation of rehashing for quadratic probing
See in textbook rehashing for separate chaining hash table.
29