Unit – V
Searching, Sorting and Hashing
Dr.P.Ganeshkumar
Dept. of Information Technology
Anna University Regional Campus, Coimbatore
Hashing - Need
• To design a system for storing employee records keyed
using phone numbers
– Insert a phone number and corresponding information.
– Search a phone number and fetch the information.
– Delete a phone number and related information.
• The following data structures may be used to maintain
information about different phone numbers
– Array of phone numbers and records.
– Linked List of phone numbers and records.
– Balanced binary search tree with phone numbers as keys.
– Direct Access Table.
10/11/2019 Dr.P.Ganeshkumar 2
Hashing - Need
• For arrays and linked lists, we need to search in a linear fashion,
which can be costly in practice.
• With balanced binary search tree, we get moderate search,
insert and delete times.
• Use a direct access table where we make a big array and use
phone numbers as index in the array. Even though it seems to be
best but
– Extra space required is huge
– An integer in a programming language may not store n digits
• Hashing can perform extremely well compared to above data
structures like Array, Linked List, Balanced BST in practice.
• Hashing is an improvement over Direct Access Table.
• The idea is to use hash function that converts a given phone
number or any other key to a smaller number and uses the small
number as index in a table called hash table.
10/11/2019 Dr.P.Ganeshkumar 3
Hash Function and Hash Table
• A function that converts a given big phone
number to a small practical integer value.
• The mapped integer value is used as an index in
hash table.
• A good hash function should have following
properties
– Efficiently computable.
– Should uniformly distribute the keys
• Hash Table
– An array that stores pointers to records corresponding
to a given phone number.
10/11/2019 Dr.P.Ganeshkumar 4
Hashing Data Structure
• Designed to use a special function called the Hash function which is used
to map a given value with a particular key for faster access of elements.
• The efficiency of mapping depends on the efficiency of the hash
function used. Let a hash function H(x) maps the value at the
index x%10 in an Array.
• For example if the list of values is [11,12,13,14,15] it will be stored at
positions {1,2,3,4,5} in the array or Hash table respectively.
10/11/2019 Dr.P.Ganeshkumar 5
Collision in Hashing
• Since a hash function gets us a small number for a key which is a big
integer or string, there is a possibility that two keys result in the same
value.
• The situation where a newly inserted key maps to an already occupied
slot in the hash table is called collision and must be handled using some
collision handling technique.
Without Collision With Collision
10/11/2019 Dr.P.Ganeshkumar 6
How to handle Collision?
• There are ways to handle collisions
– Separate Chaining
• The idea is to make each cell of hash table point to a linked
list of records that have same hash function value.
• Chaining is simple, but requires additional memory outside
the table.
– Open Addressing
• In open addressing, all elements are stored in the hash
table itself. Each table entry contains either a record or NIL.
• When searching for an element, we one by one examine
table slots until the desired element is found or it is clear
that the element is not in the table.
10/11/2019 Dr.P.Ganeshkumar 7
Separate Chaining
• The idea is to make each cell of hash table point to a linked list of records that have same
hash function value.
• Let us consider a simple hash function as “key mod 7” and sequence of keys as 50, 700,
76, 85, 92, 73, 101.
10/11/2019 Dr.P.Ganeshkumar 8
Separate Chaining
• Advantages
– Simple to implement.
– Hash table never fills up, we can always add more elements to
the chain.
– Less sensitive to the hash function or load factors.
– It is mostly used when it is unknown how many and how
frequently keys may be inserted or deleted.
• Disadvantages
– Cache performance of chaining is not good as keys are stored
using a linked list.
– Wastage of Space (Some Parts of hash table are never used)
– If the chain becomes long, then search time can become worst.
– Uses extra space for links.
10/11/2019 Dr.P.Ganeshkumar 9
Open Addressing
• In Open Addressing, all elements are stored in the
hash table itself. So at any point, size of the table must
be greater than or equal to the total number of keys.
• Cane be done in two ways
– Linear Probing
• We linearly probe for next slot.
– Quadratic Probing
• We look for i2‘th slot in i’th iteration.
– Double Hashing
• We use another hash function hash2(x) and look for i*hash2(x) slot
in i’th rotation.
10/11/2019 Dr.P.Ganeshkumar 10
Linear Probing
• Let hash(x) be the slot index computed using hash function and S be the table size.
– If slot hash(x) % S is full, then we try (hash(x) + 1) % S
– If (hash(x) + 1) % S is also full, then we try (hash(x) + 2) % S
– If (hash(x) + 2) % S is also full, then we try (hash(x) + 3) % S
• Let us consider a simple hash function as “key mod 7” and sequence of keys as 50, 700,
76, 85, 92, 73, 101.
• Advantages
• Easy to compute
• Best cache performance
• Disadvantages
• Suffers from clustering
• many consecutive
elements form groups
and it starts taking time
to find a free slot or
to search an element.
10/11/2019 Dr.P.Ganeshkumar 11
Linear Probing
• Initial hash table
• hi(data) = data mod key
• key = 17
Insert 7 at h0(7) (7 mod 17) = 7
Insert 36 at h0(36) (36 mod 17) = 2
Insert 24 at h1(24) (24 mod 17) = 7
Insert 75 at h2(75) (75 mod 17) = 7
10/11/2019 Dr.P.Ganeshkumar 12
Quadratic Probing
• Let hash(x) be the slot index computed using hash function.
– If slot hash(x) % S is full, then we try (hash(x) + 1*1) % S
– If (hash(x) + 1*1) % S is also full, then we try (hash(x) + 2*2) % S
– If (hash(x) + 2*2) % S is also full, then we try (hash(x) + 3*3) % S
• Quadratic probing lies between the two in terms of cache
performance and clustering.
Initial hash table
Insert 5 at h0(5) (5 mod 17) = 5
10/11/2019 Dr.P.Ganeshkumar 13
Quadratic Probing
Insert 56 at h1(56) (56 mod 17) = 5
((56 + 1*1)mod 17) = 6
Insert 73 at h2(73) (73 mod 17) = 5
((73 + 2*2)mod 17) = 9
Insert 124 at h3(124) (124 mod 17) = 5
((124 + 3*3)mod 17) = 6
10/11/2019 Dr.P.Ganeshkumar 14
Random Probing
• Let hash(x) be the slot index computed using hash
function.
– If slot hash(x) % S is full, then we try (hash(x) + RandomGen())
%S
– If (hash(x) + RandomGen()) % S is also full, then we try (hash(x)
+ RandomGen()) % S
– If (hash(x) + RandomGen()) % S is also full, then we try (hash(x)
+ RandomGen()) % S
• Use Randomize(X)to ‘seed’the random number
generator using X
• Each call of RandomGen()will return the next
random number in the random sequence for seed X
10/11/2019 Dr.P.Ganeshkumar 15
Double Hashing
• Let hash(x) be the slot index computed using hash
function.
– If slot hash(x) % S is full, then we try (hash(x) +
1*hash2(x)) % S
– If (hash(x) + 1*hash2(x)) % S is also full, then we try
(hash(x) + 2*hash2(x)) % S
– If (hash(x) + 2*hash2(x)) % S is also full, then we try
(hash(x) + 3*hash2
• Double hashing has poor cache performance but
no clustering.
• Double hashing requires more computation time
as two hash functions need to be computed.
10/11/2019 Dr.P.Ganeshkumar 16
Separate Chaining Vs Open Addressing
S.No. Separate Chaining Open Addressing
1 Chaining is Simpler to implement. Open Addressing requires more
computation.
2 In chaining, Hash table never fills up, we can In open addressing, table may become
always add more elements to chain. full.
3 Chaining is Less sensitive to the hash function Open addressing requires extra care
or load factors. for to avoid clustering and load factor.
4 Chaining is mostly used when it is unknown Open addressing is used when the
how many and how frequently keys may be frequency and number of keys is
inserted or deleted. known.
5 Cache performance of chaining is not good as Open addressing provides better cache
keys are stored using linked list. performance as everything is stored in
the same table.
6 Wastage of Space (Some Parts of hash table in In Open addressing, a slot can be used
chaining are never used). even if an input doesn’t map to it.
7 Chaining uses extra space for links. No links in Open addressing
10/11/2019 Dr.P.Ganeshkumar 17
Extendible Hashing
• A dynamic hashing method wherein directories,
and buckets are used to hash data.
• An aggressively flexible method in which the hash
function also experiences dynamic changes.
• Main features
– Directories
• The directories store addresses of the buckets in pointers
• An id is assigned to each directory which may change each
time when Directory Expansion takes place
– Buckets
• The buckets are used to hash the actual data
10/11/2019 Dr.P.Ganeshkumar 18
10/11/2019 Dr.P.Ganeshkumar 19
Frequently Used Terms
• Directories:
– These containers store pointers to buckets. Each directory is given a unique id which may change each
time when expansion takes place. The hash function returns this directory id which is used to navigate
to the appropriate bucket. Number of Directories = 2^Global Depth.
• Buckets:
– They store the hashed keys. Directories point to buckets. A bucket may contain more than one
pointers to it if its local depth is less than the global depth.
• Global Depth:
– It is associated with the Directories. They denote the number of bits which are used by the hash
function to categorize the keys. Global Depth = Number of bits in directory id.
• Local Depth:
– It is the same as that of Global Depth except for the fact that Local Depth is associated with the
buckets and not the directories. Local depth in accordance with the global depth is used to decide the
action that to be performed in case an overflow occurs. Local Depth is always less than or equal to the
Global Depth.
• Bucket Splitting:
– When the number of elements in a bucket exceeds a particular size, then the bucket is split into two
parts.
• Directory Expansion:
– Directory Expansion Takes place when a bucket overflows. Directory Expansion is performed when
the local depth of the overflowing bucket is equal to the global depth.
10/11/2019 Dr.P.Ganeshkumar 20
Basic Working Principle
10/11/2019 Dr.P.Ganeshkumar 21
Basic Working Principle
• Step 1 – Analyze Data Elements: Data elements may
exist in various forms eg. Integer, String, Float, etc..
Currently, let us consider data elements of type
integer. eg: 49.
• Step 2 – Convert into binary format: Convert the
data element in Binary form. For string elements,
consider the ASCII equivalent integer of the starting
character and then convert the integer into binary
form. Since we have 49 as our data element, its
binary form is 110001.
• Step 3 – Check Global Depth of the
directory. Suppose the global depth of the Hash-
directory is 3.
10/11/2019 Dr.P.Ganeshkumar 22
Basic Working Principle
• Step 4 – Identify the Directory: Consider the
‘Global-Depth’ number of LSBs in the binary
number and match it to the directory id.
Eg. The binary obtained is: 110001 and the global-
depth is 3. So, the hash function will return 3 LSBs
of 110001 viz. 001.
• Step 5 – Navigation: Now, navigate to the bucket
pointed by the directory with directory-id 001.
• Step 6 – Insertion and Overflow Check: Insert the
element and check if the bucket overflows. If an
overflow is encountered, go to step 7 followed
by Step 8, otherwise, go to step 9.
10/11/2019 Dr.P.Ganeshkumar 23
Basic Working Principle
• Step 7 – Tackling Over Flow Condition during Data Insertion: Many
times, while inserting data in the buckets, it might happen that the
Bucket overflows. In such cases, we need to follow an appropriate
procedure to avoid mishandling of data. First, Check if the local depth is
less than or equal to the global depth. Then choose one of the cases
below.
– Case1: If the local depth of the overflowing Bucket is equal to the global
depth, then Directory Expansion, as well as Bucket Split, needs to be
performed. Then increment the global depth and the local depth value by 1.
And, assign appropriate pointers. Directory expansion will double the
number of directories present in the hash structure.
– Case2: In case the local depth is less than the global depth, then only Bucket
Split takes place. Then increment only the local depth value by 1. And, assign
appropriate pointers.
• Step 8 – Rehashing of Split Bucket Elements: The Elements present in
the overflowing bucket that is split are rehashed w.r.t the new global
depth of the directory.
• Step 9 – The element is successfully hashed.
10/11/2019 Dr.P.Ganeshkumar 24
Example based on Extendible Hashing
• Now, let us consider a prominent example of hashing the following
elements: 16,4,6,22,24,10,31,7,9,20,26.
• Bucket Size: 3 (Assume)
• Hash Function: Suppose the global depth is X. Then the Hash
Function returns X LSBs.
• Solution: First, calculate the binary forms of each of the given
numbers.
16- 10000
4- 00100
6- 00110
22- 10110
24- 11000
10- 01010
31- 11111
7- 00111
9- 01001
20- 10100
26- 01101
10/11/2019 Dr.P.Ganeshkumar 25
Example based on Extendible Hashing
• Initially, the global-depth and local-depth is always 1. Thus, the hashing
frame looks like this:
• Inserting 16: The binary format of 16 is 10000 and global-depth is 1. The
hash function returns 1 LSB of 10000 which is 0. Hence, 16 is mapped to
the directory with id=0.
• Inserting 4 and 6: Both 4(100) and 6(110)have 0 in their LSB. Hence, they
are hashed as follows:
10/11/2019 Dr.P.Ganeshkumar 26
Example based on Extendible Hashing
• Inserting 22: The binary form of 22 is 10110. Its LSB is 0. The bucket pointed by directory 0
is already full. Hence, Over Flow occurs.
• As directed by Step 7-Case 1, Since Local Depth = Global Depth, the bucket splits and directory
expansion takes place. Also, rehashing of numbers present in the overflowing bucket takes
place after the split. And, since the global depth is incremented by 1, now, the global depth is 2.
Hence, 16,4,6,22 are now rehashed w.r.t 2 LSBs.[ 16(10000),4(100),6(110),22(10110) ]
10/11/2019 Dr.P.Ganeshkumar 27
Example based on Extendible Hashing
• Inserting 24 and 10: 24(11000) and 10 (1010) can be hashed based on directories with id
00 and 10. Here, we encounter no overflow condition.
• Inserting 31,7,9: All of these elements[ 31(11111), 7(111), 9(1001) ] have either 01 or 11 in
their LSBs. Hence, they are mapped on the bucket pointed out by 01 and 11. We do not
encounter any overflow condition here.
10/11/2019 Dr.P.Ganeshkumar 28
Example based on Extendible Hashing
• Inserting 20: Insertion of data element 20 (10100) will
again cause the overflow problem.
10/11/2019 Dr.P.Ganeshkumar 29
Example based on Extendible Hashing
• 20 is inserted in bucket pointed out by 00. As directed by Step 7-Case 1, since
the local depth of the bucket = global-depth, directory expansion (doubling)
takes place along with bucket splitting. Elements present in overflowing bucket
are rehashed with the new global depth. Now, the new Hash table looks like this:
10/11/2019 Dr.P.Ganeshkumar 30
Example based on Extendible Hashing
• Inserting 26: Global depth is 3. Hence, 3 LSBs of 26(11010) are
considered. Therefore 26 best fits in the bucket pointed out by directory
010.
10/11/2019 Dr.P.Ganeshkumar 31
Example based on Extendible Hashing
• The bucket overflows, and, as directed by Step 7-Case 2, since the local depth
of bucket < Global depth (2<3), directories are not doubled but, only the
bucket is split and elements are rehashed. Finally, the output of hashing the
given list of numbers is obtained.
10/11/2019 Dr.P.Ganeshkumar 32
Key Observations
• A Bucket will have more than one pointers pointing to it if
its local depth is less than the global depth.
• When overflow condition occurs in a bucket, all the entries
in the bucket are rehashed with a new local depth.
• If Local Depth of the overflowing bucket is equal to the
global depth, only then the directories are doubled and the
global depth is incremented by 1.
• The size of a bucket cannot be changed after the data
insertion process begins.
10/11/2019 Dr.P.Ganeshkumar 33
Advantages
• Data retrieval is less expensive (in terms of
computing).
• No problem of Data-loss since the storage
capacity increases dynamically.
• With dynamic changes in hashing function,
associated old values are rehashed w.r.t the new
hash function.
10/11/2019 Dr.P.Ganeshkumar 34
Limitations Of Extendible Hashing
• The directory size may increase significantly if several
records are hashed on the same directory while
keeping the record distribution non-uniform.
• Size of every bucket is fixed.
• Memory is wasted in pointers when the global depth
and local depth difference becomes drastic.
• This method is complicated to code.
10/11/2019 Dr.P.Ganeshkumar 35
Data Structures used for
implementation of Extendible Hashing
• B+ Trees
• Array
• Linked List
10/11/2019 Dr.P.Ganeshkumar 36
Sorting
• Technique used to rearrange a given list elements
according to a comparison operator on the
elements.
• The comparison operator is used to decide the new
order of element in the respective data structure.
– Bubble Sort
– Selection Sort
– Insertion Sort
– Shell Sort
– Radix Sort
10/11/2019 Dr.P.Ganeshkumar 37
Bubble Sort
• Bubble Sort is the simplest sorting algorithm that works by repeatedly swapping the
adjacent elements if they are in wrong order.
• Example: ( 5 1 4 2 8 )
• First Pass:
– ( 5 1 4 2 8 ) –> ( 1 5 4 2 8 ), Here, algorithm compares the first two elements, and swaps since 5 > 1.
– ( 1 5 4 2 8 ) –> ( 1 4 5 2 8 ), Swap since 5 > 4
– ( 1 4 5 2 8 ) –> ( 1 4 2 5 8 ), Swap since 5 > 2
– ( 1 4 2 5 8 ) –> ( 1 4 2 5 8 ),
– Now, since these elements are already in order (8 > 5), algorithm does not swap them.
• Second Pass:
– ( 1 4 2 5 8 ) –> ( 1 4 2 5 8 )
– ( 1 4 2 5 8 ) –> ( 1 2 4 5 8 ), Swap since 4 > 2
– ( 1 2 4 5 8 ) –> ( 1 2 4 5 8 )
– ( 1 2 4 5 8 ) –> ( 1 2 4 5 8 )
– Now, the array is already sorted, but our algorithm does not know if it is completed. The algorithm needs
one whole pass without any swap to know it is sorted.
• Third Pass:
– ( 1 2 4 5 8 ) –> ( 1 2 4 5 8 )
– ( 1 2 4 5 8 ) –> ( 1 2 4 5 8 )
– ( 1 2 4 5 8 ) –> ( 1 2 4 5 8 )
– ( 1 2 4 5 8 ) –> ( 1 2 4 5 8 )
10/11/2019 Dr.P.Ganeshkumar 38
Selection Sort
• The selection sort algorithm sorts an array by repeatedly finding
the minimum element (considering ascending order) from
unsorted part and putting it at the beginning.
• The algorithm maintains two subarrays in a given array.
– 1)The subarray which is already sorted.
– 2) Remaining subarray which is unsorted.
• arr[] = 64 25 12 22 11
• Find the minimum element in arr[0...4] and place it at beginning
11 25 12 22 64
• Find the minimum element in arr[1...4] and place it at beginning
of arr[1...4] 11 12 25 22 64
• Find the minimum element in arr[2...4] and place it at beginning
of arr[2...4] 11 12 22 25 64
• Find the minimum element in arr[3...4] and place it at beginning
of arr[3...4] 11 12 22 25 64
10/11/2019 Dr.P.Ganeshkumar 39
Insertion Sort
• Insertion sort is a simple sorting algorithm that
works the way we sort playing cards in our hands.
10/11/2019 Dr.P.Ganeshkumar 40
Shellsort Examples
Sort: 18 32 12 5 38 33 16 2
8 Numbers to be sorted, Shell’s increment will be floor(n/2)
* floor(8/2) ➔ floor(4) = 4
increment 4: 1 2 3 4 (visualize underlining)
18 32 12 5 38 33 16 2
Step 1) Only look at 18 and 38 and sort in order ;
18 and 38 stays at its current position because they are in order.
Step 2) Only look at 32 and 33 and sort in order ;
32 and 33 stays at its current position because they are in order.
10/11/2019 Dr.P.Ganeshkumar 41
Shellsort Examples
Sort: 18 32 12 5 38 33 16 2
8 Numbers to be sorted, Shell’s increment will be floor(n/2)
* floor(8/2) ➔ floor(4) = 4
increment 4: 1 2 3 4 (visualize underlining)
18 32 12 5 38 33 16 2
Step 3) Only look at 12 and 16 and sort in order ;
12 and 16 stays at its current position because they are in order.
Step 4) Only look at 5 and 2 and sort in order ;
2 and 5 need to be switched to be in order.
10/11/2019 Dr.P.Ganeshkumar 42
Shellsort Examples (con’t)
Sort: 18 32 12 5 38 33 16 2
Resulting numbers after increment 4 pass:
18 32 12 2 38 33 16 5
* floor(4/2) ➔ floor(2) = 2
increment 2: 1 2
18 32 12 2 38 33 16 5
Step 1) Look at 18, 12, 38, 16 and sort them in their appropriate location:
12 38 16 2 18 33 38 5
Step 2) Look at 32, 2, 33, 5 and sort them in their appropriate location:
10/11/2019
12 2 16 5 18
Dr.P.Ganeshkumar
32 38 33 43
Shellsort Examples (con’t)
Sort: 18 32 12 5 38 33 16 2
* floor(2/2) ➔ floor(1) = 1
increment 1: 1
12 2 16 5 18 32 38 33
2 5 12 16 18 32 33 38
The last increment or phase of Shellsort is basically an Insertion
Sort algorithm.
10/11/2019 Dr.P.Ganeshkumar 44
RadixSort
• Input list:
126, 328, 636, 341, 416, 131, 328
• BinSort on lower digit:
341, 131, 126, 636, 416, 328, 328
• BinSort result on next-higher digit:
416, 126, 328, 328, 131, 636, 341
• BinSort that result on highest digit:
126, 131, 328, 328, 341, 416, 636
10/11/2019 Dr.P.Ganeshkumar 45
BinSort example
• K=5. list=(5,1,3,4,3,2,1,1,5,4,5)
Bins in array
key = 1 1,1,1
key = 2 2 Sorted list:
1,1,1,2,3,3,4,4,5,5,5
key = 3 3,3
key = 4 4,4
key = 5 5,5,5
10/11/2019 Dr.P.Ganeshkumar 46
Linear Search
• Given an array arr[] of n elements
– Start from the leftmost element of arr[] and one by
one compare x with each element of arr[]
– If x matches with an element, return the index.
– If x doesn’t match with any of elements, return -1.
10/11/2019 Dr.P.Ganeshkumar 47
Binary Search
• Given a sorted array arr[] of n elements
– Search a sorted array by repeatedly dividing the
search interval in half.
– Begin with an interval covering the whole array.
• If the value of the search key is less than the item in
the middle of the interval, narrow the interval to the
lower half.
• Otherwise narrow it to the upper half. Repeatedly
check until the value is found or the interval is
empty.
10/11/2019 Dr.P.Ganeshkumar 48
Binary Search
10/11/2019 Dr.P.Ganeshkumar 49
Linear Search vs Binary Search
• A linear search scans one item at a time, without jumping to any item
– The worst case complexity is O(n), sometimes known an O(n) search
– Time taken to search elements keep increasing as the number of elements
are increased.
• A binary search however, cut down your search to half as soon as you
find middle of a sorted list.
– The middle element is looked to check if it is greater than or less than the
value to be searched.
– Accordingly, search is done to either half of the given list
• Important Differences
– Input data needs to be sorted in Binary Search and not in Linear Search
– Linear search does the sequential access whereas Binary search access data
randomly.
– Time complexity of linear search -O(n) , Binary search has time complexity
O(log n).
– Linear search performs equality comparisons and Binary search performs
ordering comparisons
10/11/2019 Dr.P.Ganeshkumar 50
10/11/2019 Dr.P.Ganeshkumar 51
10/11/2019 Dr.P.Ganeshkumar 52