BLOOM FILTER
Dr. CHANDRALEKHA M
ASSISTANT PROFESSOR
DEPT. OF COMPUTER SCIENCE AND ENGINEERING
AMRITA SCHOOL OF COMPUTING, CHENNAI CAMPUS
Mob. No: +91 9442414745
1. Suppose you are creating an account on Gmail, you want to enter a
cool username, you entered it and got a message, “Username is
already taken”.
2. You added your birth date along username, still no luck.
3. Now you have added your university roll number also, still got
“Username is already taken”.
4. It’s really frustrating, isn’t it?
5. But have you ever thought about how quickly Gmail checks
availability of username by searching millions of username
registered with it.
There are many ways to do this job –
Linear search : Bad idea!
Binary Search : Store all username alphabetically and compare entered
username with middle one in list, If it matched, then username is taken
otherwise figure out, whether entered username will come before or
after middle one and if it will come after, neglect all the usernames
before middle one(inclusive). Now search after middle one and repeat
this process until you got a match or search end with no match. This
technique is better and promising but still it requires multiple steps.
But, there must be something better!!
Bloom Filter is a data structure that can do this job.
LINEAR SEARCH
Let the elements of array are –
• Let the element to be searched is K = 41
• Now, start from the first element and compare K with each element of
the array.
• The value of K, i.e., 41, is not matched with the first element of the
array.
• So, move to the next element. And follow the same process until the
respective element is found.
Now, the element to be searched is found. So algorithm will return the index of the element matched.
BINARY SEARCH
Let the elements of array are –
Let the element to search is, K = 56
We have to use the below formula to calculate the mid of the array -
mid = (beg + end)/2
So, in the given array -
beg = 0
end = 8
mid = (0 + 8)/2 = 4. So, 4 is the mid of the array.
Now, the element to search is found. So algorithm will return the index of the element
matched.
Binary search is implemented using following steps...
Step 1 - Read the search element from the user.
Step 2 - Find the middle element in the sorted list.
Step 3 - Compare the search element with the middle element in the sorted list.
Step 4 - If both are matched, then display "Given element is found!!!" and
terminate the function.
Step 5 - If both are not matched, then check whether the search element is smaller
or larger than the middle element.
Step 6 - If the search element is smaller than middle element, repeat steps 2, 3, 4
and 5 for the left sublist of the middle element.
Step 7 - If the search element is larger than middle element, repeat steps 2, 3, 4 and
5 for the right sublist of the middle element.
Step 8 - Repeat the same process until we find the search element in the list or
until sublist contains only one element.
Step 9 - If that element also doesn't match with the search element, then
display "Element is not found in the list!!!" and terminate the function.
What is Bloom Filter?
• Bloom filter is a space-efficient probabilistic data structure
(data structures that provide approximate answers to queries
about a large dataset, rather than exact answers) that tells
whether an element may be in a set or definitely is not.
• If we look up an item in the Bloom filter, we can get two
possible results.
✓The item is not present in the set: True negative.
✓The item might be present in the set: Can be either a False
positive or True positive.
• For example, checking availability of username is set membership
problem, where the set is the list of all registered username.
• The price we pay for efficiency is that it is probabilistic in nature that
means, there might be some False Positive results.
• False positive means, it might tell that given username is already
taken but actually it’s not.
Properties of Bloom Filters
• Unlike a standard hash table, a Bloom filter of a fixed size can
represent a set with an arbitrarily large number of elements.
• Adding an element never fails. However, the false positive rate
increases steadily as elements are added until all bits in the filter are
set to 1, at which point all queries yield a positive result.
• Bloom filters never generate false negative result, i.e., telling you that
a username doesn’t exist when it actually exists.
• Deleting elements from filter is not possible because, if we delete a
single element by clearing bits at indices generated by k hash
functions, it might cause deletion of few other elements.
• Example – if we delete “geeks” (in given example below) by clearing
bit at 1, 4 and 7, we might end up deleting “nerd” also Because bit at
index 4 becomes 0 and bloom filter claims that “nerd” is not present.
Working of Bloom Filter
An empty bloom filter is a bit array of m bits, all set to zero, like this –
• We need k number of hash functions to calculate the hashes for a
given input.
• When we want to add an item in the filter, the bits at k indices h1(x),
h2(x),… hk(x) are set, where indices are calculated using hash
functions.
Example – Suppose we want to enter “geeks” in the filter, we are using
3 hash functions and a bit array of length 10, all set to 0 initially. Firstly
we’ll calculate the hashes as follows:
h1(“geeks”) % 10 = 1
h2(“geeks”) % 10 = 4
h3(“geeks”) % 10 = 7
Note: These outputs are random for explanation only.
Now we will set the bits at indices 1, 4 and 7 to 1
Again, we want to enter “nerd”, similarly, we’ll calculate hashes
h1(“nerd”) % 10 = 3
h2(“nerd”) % 10 = 5
h3(“nerd”) % 10 = 4
Set the bits at indices 3, 5 and 4 to 1
• Now if we want to check “geeks” is present in filter or not. We’ll do
the same process but this time in reverse order.
• We calculate respective hashes using h1, h2 and h3 and check if all
these indices are set to 1 in the bit array.
• If all the bits are set then we can say that “geeks” is probably present.
• If any of the bit at these indices are 0 then “geeks” is definitely not
present.
False Positive in Bloom Filters
The question is why we said “probably present”, why this uncertainty.
Let’s understand this with an example.
Suppose we want to check whether “cat” is present or not.
We’ll calculate hashes using h1, h2 and h3
h1(“cat”) % 10 = 1
h2(“cat”) % 10 = 3
h3(“cat”) % 10 = 7
• If we check the bit array, bits at these indices are set to 1 but we know
that “cat” was never added to the filter.
• Bit at index 1 and 7 was set when we added “geeks” and bit 3 was set
we added “nerd”.
• So, because bits at calculated indices are already set by some other
item, bloom filter erroneously claims that “cat” is present and
generating a false positive result.
• Depending on the application, it could be huge downside or relatively
okay.
• We can control the probability of getting a false positive by
controlling the size of the Bloom filter.
• More space means fewer false positives.
• If we want to decrease probability of false positive result, we have to
use more number of hash functions and larger bit array.
• This would add latency in addition to the item and checking
membership.
Operations that a Bloom Filter supports
insert(x) : To insert an element in the Bloom Filter.
lookup(x) : to check whether an element is already present in Bloom
Filter with a positive false probability.
NOTE : We cannot delete an element in Bloom Filter.
Example of Bloom Filter
Suppose that the size of our bloom filter is m = 10.
Inserting an item to the bloom filter
• For example, we want to add the word “coding”.
• After passing it through three hash functions, we get the following
results.
h1(“coding”) = 125
h2(“coding”) = 67
h3(“coding”) = 19
• We need to take mod of 10 for each of these values so that the index is
within the bounds of the bloom filter.
• Therefore, indexes at 125%10 = 5, 67%10 = 7 and 19%10 = 9 have to
be set to 1.
Testing membership of an item in Bloom filter
• If we want to test the membership of an element, we need to pass it
through same hash functions.
• If bits are already set for all these indexes, then this element might
exist in the set.
• However, even if one index is not set, we are sure that this element is
not present in the set.
• Let’s say we want to check the membership of “cat” in our set.
• Furthermore, we have already added two elements, “coding” and
“music”, to our set.
• We pass “cat” through the same hash functions and get the following
results.
• Coding has the hash output {125, 67, 19} from the three hash
functions, and as discussed above, the indexes {5, 7, 9} are set to 1.
• Music has the hash output {290, 145, 2} and the indexes {0, 2, 5} are
set to 1.
• We pass “cat” through the same hash functions and get the following
results.
• Coding has the hash output {125, 67, 19} from the three hash
functions, and as discussed above, the indexes {5, 7, 9} are set to 1.
• Music has the hash output {290, 145, 2} and the indexes {0, 2, 5} are
set to 1.
h1(“cat”) = 233
h2(“cat”) = 155
h3(“cat”) = 9
• So, we check if the indexes {3, 5, 9} are all set to 1.
• As we can see, even though indexes 5 and 9 are set to 1, 3 is not.
• Thus, we can conclude with 100% certainty that “cat” is not present in
the set.
• Now let’s say we want to check existence of “gaming” in our set.
• We pass it through same hash functions and get the following results.
h1(“gaming”) = 235
h2(“gaming”) = 60
h3(“gaming”) = 22
• We check if the indexes {0, 2, 5} are all set to 1.
• We can see that all of these indexes are set to 1.
• However, we know that “gaming” is not present in the set.
• So, this is a false positive.
Applications of Bloom Filter
• Weak password detection
• Internet Cache Protocol
• Safe browsing in Google Chrome
• Wallet synchronization in Bitcoin
• Hash based IP Traceback
• Cyber security like virus scanning, Worm detection, DDoS prevention
Risky URL detection
• Determining whether a user ID or domain is already taken
• Filtering out previously shown posts on recommendation engines
• Checking words for misspellings and profanity with a spellchecker
• Identifying malicious URLs, blocked IPs, and fraudulent transactions
• Databases: Many popular databases use Bloom filters to reduce the
costly disk lookups for non-existent rows or columns. This technique
is used by PostgreSQL, Apache Cassandra, Cloud Bigtable, etc
Advantages of Bloom filter:
1.It uses constant space, regardless of the number of elements inserted.
2.No false negatives, so you can trust the Bloom filter when it says the
item does not exist.
3.Adding an element never fails.
4.It does not store the actual elements, ensuring privacy out of the box.
Disadvantages of Bloom filter:
1.It can return false positives, so you can’t always trust the Bloom filter
when it says the element exists.
2.Adding elements never fails, but at the cost of an ever-increasing false
positive rate.
3.Reducing false-positive rates requires an additional bit array or
recreation of the Bloom filter.
4.Cannot retrieve the inserted elements.
5.Cannot delete the inserted elements.
THANK YOU ☺