Module3A MiningBigDataStreams
Module3A MiningBigDataStreams
2
Motivating Examples
3
4
Achieve real time customer
intelligence
• For many enterprises,
high performance
clickstream
processing is a vital
business function.
• From websites to
mobile devices, we
need to capture and
immediately process
customer interactions.
• Results in real-time
business intelligence,
reporting,
personalization, and
dynamic pricing.
5
Process events from IoT devices
• Smart factories, connected cars, smart cities, and other IoT devices generate a lot
of data continuously.
• Stream processing solution captures IoT data streams and processes them in a
centralized way
6
Immediately detect & prevent
suspicious activity
• In any type of
financial transaction
fraud or ad fraud,
streaming is
necessary to detect
and automatically
react to suspicious
activity in real-time.
• We can combine
state of the art
machine learning
anomaly detection
algorithms with
high performance
stream processing
engines to prevent
fraud.
7
Data Streams (Streaming Data)
Streaming data refers to data that
is continuously generated, usually
in high volumes and at high velocity.
A streaming data source would
typically consist of a stream of logs
that record events as they happen
user clicking on a link in a web page
sensor reporting the current
temperature.
Common examples of streaming data A data stream is a
include: constant flow of
IoT sensors data, which updates
Server and security logs with high frequency
Real-time advertising and loses its
Click-stream data from apps and relevance in a short
websites 8
Streams – A New
Model
Traditional DBMS: data stored in finite, persistent
data sets
Data Streams: distributed, continuous, unbounded,
rapid, time varying, noisy, . . .
Data-Stream Management: variety of modern
applications.
– Network monitoring and traffic engineering
– Sensor networks
– Telecom call-detail records
– Network security
– Financial applications
– Manufacturing processes
– Web logs and clickstreams
– Other massive data sets…
9
Typical Applications
Heavy machinery/transportation/fleet operations:
sourcing data streams from sensors and IoT devices;
Healthcare: real-time monitoring of health-conditions,
clinical risk-assessment, client-state analysis, and alerts;
Finance: transaction processing, market/currency state
monitoring;
Retail/customer service: customer behavior analysis
and operations improvement;
Manufacturing/supply chain: real-time monitoring,
predictive maintenance, disruption/risk assessment;
Home security: IoT data stream analysis, smart
protection, and alert systems improvement;
Security : CCTV footage
10
11
Uber: Chaperone Tool (KafKa)
An international ride-hailing and food-delivery service.
two real-time instances:
Tracking location of drivers and clients requires constant
data flow and updates of geolocation, pushing this data
to both types of application users. This means that Uber
has to deal with petabytes of messages to keep track of
data flow.
Constant financial flow coming from Uber users that
make payments directly through the application requires
monitoring. Financial operations mean there is a high
risk of fraud. So, in addition to the amount of controlling
streamed data, Uber also has to be on the alert with
fraud detection.
12
Human resources
By applying streaming analytics to data streams,
such as email, time reporting apps, injury
reports, and other resources, managers can gain
deeper insights into behavioral patterns that may
suggest an employee is burning out due to
excessive hours or actively interviewing at other
companies.
The insights can help HR professionals and line-
of-business managers proactively balance
workloads, offer more competitive
compensation, or provide training and
development to retain valued team members.
13
DBMS vs. DSMS #1
Query Processing
Main Memory
Query Processing
Disk
14
DBMS vs. DSMS #2
DSMS:
Traditional DBMS: support on-line analysis
stored sets of of rapidly changing data
relatively static streams
records with no pre- data stream: real-time,
defined notion of time continuous, ordered
good for applications (implicitly by arrival time
that require persistent or explicitly by
data storage and timestamp) sequence
complex querying of items, too large to
store entirely, not
ending
continuous queries
15
DBMS vs. DSMS #3
DBMS DSMS
Persistent relations • Transient streams
(relatively static, stored) (on-line analysis)
One-time queries • Continuous queries
Random access • Sequential access
“Unbounded” disk store • Bounded main memory
Only current state matters • Historical data is important
No real-time services • Real-time requirements
Relatively low update rate • multi-GB arrival rate
Data at any granularity • Data at fine granularity
Assume precise data • Data stale/imprecise
Access plan determined by • Unpredictable/variable
query processor, physical DB data arrival and
design characteristics
Working
Query Processor
Storage
19
General Stream Processing Model
Ad-Hoc
Queries
. . . 1, 5, 2, 7, 0, 9, 3 Standing
Queries
. . . a, r, v, t, y, h, b Output
Processor
. . . 0, 0, 1, 0, 1, 1, 0
time
Streams Entering.
Each is stream is
composed of
elements/tuples
Limited
Working
Storage Archival
Storage
20
Data Streams Mining
Definition
Continuous, unbounded, rapid, time-varying
streams of data elements
Application Characteristics
Massive volumes of data (can be several
terabytes)
Records arrive at a rapid rate
Goal
Mine patterns, process queries and compute
statistics on data streams in real-time
21
Streaming Analytics
Big data streaming is a process in which large streams of
real-time data are processed with the sole aim of
extracting insights and useful trends out of it.
A continuous stream of unstructured data is sent for
analysis into memory before storing it onto disk.
This happens across a cluster of servers.
Speed matters the most in big data streaming. The value
of data, if not processed quickly, decreases with time.
Real-time streaming data analysis is a single-pass
analysis. Analysts cannot choose to reanalyze the data
once it is streamed.
22
Data stream Mining – Challenges
Mining big data streams - In addition to Volume,
Velocity, Variety also Volatility.
Volatility - dynamic environment with ever-changing
patterns.
Concept drift is a phenomenon that occurs when
the distributions of features x and target variables y
change in time.
As data streams have no beginning or end, they can’t
be broken into batches. So there is no time when the
data can be uploaded into storage and processed.
Instead, data streams are processed on the fly.
23
Challenges
Single pass: each record is examined at most once
(random access not possible)
Bounded storage: limited memory for storing synopsis
Real-time: per record processing time (to maintain
synopsis) must be low
As new data arrives older data discarded to make room for
subsequent examples.
The algorithm processing the stream has no control over the
order of the examples seen, and must update its model
incrementally as each example is inspected.
An additional desirable property, the so-called anytime
property, requires that the model is ready to be applied at any
point between training examples
Generally, algorithms compute approximate answers
24
Applications (1)
Mining query streams
Google wants to know what queries are
more frequent today than yesterday
25
Applications (2)
Sensor Networks
Many sensors feeding into a central controller
Telephone call records
Data feeds into customer bills as well as
settlements between telephone companies
IP packets monitored at a switch
Gather information for optimal routing
Detect denial-of-service attacks
26
Interesting Use Cases
Customer experience and interaction
When a customer visits a retail website, their
website movements get tracked, as well as
their purchasing preferences and choices.
Recognizing and responding to consumer
buying patterns in real time can be integrated
into marketing so a customer looking at slacks
could suddenly see an offer for matching
shirts.
27
Interesting Use Cases
Environmental sensing
Ambient temperatures in data centers, conference
rooms, offices, warehouses, refrigeration units,
hospital operating rooms, etc., are all part of
providing viable work space.
If a sensor in a room suddenly detects temperatures
or humidities that are falling outside of range, an
auto-alert can be sent, and a maintenance person
can be dispatched.
The technology can save lives, prevent food spoilage
, and keep data centers running.
28
Interesting Use Cases
Systems geo-tracking and security
Who's tapping into your systems and
networks, when and from where are all
important elements of security and
governance.
So is the ability to track foot traffic through
plants and offices.
In 2020, IBM reports that the
cost of a single data breach was $3.8 million,
so the savings in cost and company reputation
are significant. 29
Interesting Use Cases
Industrial IoT
A machine failure on an assembly line can cost $1 million per day.
That failure can be prevented by an industrial sensor that can detect a
machine failing in real time before it happens.
This kind of preemptive maintenance keeps assembly lines running and
saves millions of dollars.
Logistics
Logistics companies track trucks and cars on the road with IoT sensors.
They are able to see which vehicles will arrive on time, or ahead of, or
behind schedule.
They can observe vehicle proximity and reroute a vehicle if another vehicle
in the area suffers a breakdown.
All of this is facilitated with IoT devices and sensors attached to vehicles
that are monitored in real time.
The savings can mount up. For refrigerated trucks alone, the
late fee for one load of cargo can be $500.
30
Interesting Use Cases
Patient monitoring
Healthcare clinics and hospitals can now
automatically receive vitals readings from patients
who are at home.
Alerts are issued if a patient's data indicates a
dangerous condition.
Fraud detection
In a flash, a bank card processor can detect a
fraudulent credit card transaction as soon as the
perpetrator passes the card through a card reader.
The transaction gets denied, and no money is lost.
31
Overview Data Stream Processing
32
Data Stream Queries -- Types
Answer availability
One-time
Multiple-time
Continuous (“standing”), stored or
streamed
Join queries
Registration time
Predefined
Ad hoc
33
Stream Queries Issues
- 1
Unbounded Memory Requirements
Approximate Query Answering
data reduction and synopsis construction
Sketches, random sampling, histogram & wavelets
Sliding Windows : Storing only recent
elements
SELECT AVG(S.minutes)
FROM Calls S [PARTITION BY S.customer id
ROWS 10 PRECEDING
WHERE S.type = ’Long Distance’]
34
Stream Queries Issues
- 2
Join Queries
return the average length of the last 1000 telephone calls
placed by “Gold” customers:
SELECT AVG(V.minutes)
FROM (SELECT S.minutes
FROM Calls S, Customers T
WHERE S.customer id = T.customer id
AND T.tier = ’Gold’)
V [ROWS 1000 PRECEDING]
37
Stream Query
processing Issues
38
Data Streams
Operations
Sampling data from a stream
Filtering a data stream
Select elements with property x from the
stream
Counting distinct elements
Number of distinct elements in the last k
elements
of the stream
Counting no of 1s in a window
39
Sampling
Sampling is a common practice for
selecting a subset of data to be analyzed.
We select instances at periodic intervals.
Sampling is used to compute statistics
(expected values) of the stream.
The main problem is to obtain a
representative sample, a subset of data
that has approximately the same
properties of the original data.
40
Sampling from a Data
Stream
Since we can not store the entire stream,
one obvious approach is to store a sample
Two different problems:
(1) Sample a fixed proportion of elements
in the stream (say 1 in 10)
(2) Maintain a random sample of fixed size
over a potentially infinite stream
At any “time” k we would like a random sample
of s elements
What is the property of the sample we want to maintain?
For all time steps k, each of k elements seen so far has
equal prob. of being sampled
41
Maintaining a fixed-size
sample
Suppose we need to maintain a random
sample S of size exactly s tuples
E.g., main memory size constraint
Why? Don’t know length of stream in advance
Suppose at time n we have seen n items
Each item is in the sample S with equal prob. s/n
How to think about the problem: say s = 2
Stream: a x c y z k c d e g…
At n= 5, each of the first 5 tuples is included in the sample S with equal
prob.
At n= 7, each of the first 7 tuples is included in the sample S with equal
prob.
Impractical solution would be to store all the n tuples
seen so far and out of them pick s at random
42
Solution: Fixed Size
Sample
Reservoir Sampling
Store all the first s elements of the stream to S
Suppose we have seen n-1 elements, and now
the nth element arrives (n > s)
With probability s/n, keep the nth element, else discard it
If we picked the nth element, then it replaces one of the
s elements in the sample S, picked uniformly at random
44
Decision tree for the reservoir sampling algorithm of
stream a1,a2,a3,a4.
45
46
Reservoir Sampling -
Issues
47
Reservoir Sampling -
Issues
48
Possible Solutions
49
Biased Reservoir
Sampling
50
Biased Reservoir
Sampling
A bias function to regulate the sampling from
the stream.
This bias gives a higher probability of
selecting data points from recent parts of the
stream as compared to distant past.
This bias function is quite effective since it
regulates the sampling in a smooth way so
that queries over recent horizons are more
accurately resolved.
51
Concise Sampling
Note that the size of the reservoir is sometimes
restricted by the available main memory.
The method of concise sampling exploits the fact that
the number of distinct values of an attribute is often
significantly smaller than the size of the data stream.
The sample is maintained as a set S of <value, count>
pairs.
For those pairs in which the value of count is one, we
do not maintain the count explicitly, but we maintain
the value as a singleton.
The number of elements in this representation is
referred to as the footprint
52
Concise Sampling
Duplicates in sample S stored as <value,
count> pairs (thus, potentially boosting actual
sample size)
Add each new element to S with probability 1/T
(simply increment count if element already in S)
If sample size exceeds M
Select new threshold T’ > T
Evict each element (decrement count) from S with
probability 1-T/T’
Add subsequent elements to S with probability 1/T’
53
Concise-Sampling
Example
Dataset
D = { a, a, a, b, b, b }
Footprint
F = one <value, #> pair
Three (possible) samples of size = 3
S1 = { a, a, a }, S2 = { b, b, b }, S3 = { a, a, b }.
S1 = {<a,3>}, S2 = {<b,3>}, S3 = {<a,2>,<b,1>}.
Three samples should have with equal likelihood
But Prob(S1) = Prob(S2) > 0 and Prob(S3) = 0
In general:
Concise sampling under-represents ‘rare’ population
elements
54
Filtering data
Review: Bloom Filters
56
Bloom Filter
A space-efficient probabilistic data
structure, conceived
by Burton Howard Bloom, used to test membership
False positive possible, but no false negatives
(100% recall rate).
Query returns "possibly in set" or "definitely not in
set".
Elements can be added to the set, but not removed
More elements that are added to the set, the larger
the probability of false positives
Uses considerably less space than any exact method,
but pays for this by introducing a small probability of
error
57
First Example Scenario
Suppose you are creating an account on Goodreads,
you want to enter a cool username, you entered it
and got a message, “Username is already taken”.
You added your birth date along username, still no
luck. Now you have added your university roll
number also, still got “Username is already taken”.
It’s really frustrating, isn’t it?
But have you ever thought how quickly Goodreads
checks the availability of username by searching
millions of username registered with it.
Linear search : Bad idea!
Binary Search : Way too much work Sorting etc !!
58
Scenario Contd.
Bloom Filter is a data structure that can do this job.
What is Bloom Filter?
A Bloom filter is a space-efficient probabilistic data
structure that is used to test whether an element is a
member of a set.
For example, checking availability of username is set
membership problem, where the set is the list of all
registered username.
The price we pay for efficiency is that it is probabilistic in
nature that means, there might be some False Positive
results.
False positive means, it might tell that given username
is already taken but actually it’s not.
59
Example - Google Chrome
Chrome needs to store a blacklist of dangerous URLs.
Each time a user is about to navigate to new page, need
to check against the blacklist
Size of black list – around a million; Min and max length
of a URL is 2 -2083.
Thus, we may need disk accesses to determine whether
the current URL is to be allowed or not.
Considering the large numbers of users accessing
websites using Chrome continuously, this is indeed a
data steam with very high velocity
Algorithm will use main mem only and yet will filter
most of the undesired stream elements.
60
Motivating Example
Let us assume we want to use about one megabyte of
available main memory to store the blacklist.
Bloom filter uses the main memory as a bit array being able
to store eight million bits.
A hash function “h” that maps each URL in the blacklist to
one of eight million buckets. That corresponding bit in the
bit array is set to 1. All other bits of the array remain 0.
It is possible that two URLs could hash to the same bit.
When a stream element arrives, we hash its URL. If the bit
to which it hashes is 1, then we need to further check
whether this URL is safe for browsing or not.
But if the URL hashes to a 0, then definitely address is not
in the blacklist so we can ignore this stream element.
61
Simple Example
62
Bloom Filters
Bloom filters compactly encode set
membership
k hash functions map items to bit vector k times
Set all k entries to 1 to indicate item is present
Can lookup items, store set of size n in ~ 2n bit
item
1 1 1
63
3 Hash Function Bloom
Filter
64
Bloom Filter Overview
Both insertions and membership queries should be
performed in constant time.
A Bloom filter is a bit vector B of m bits, with k
independent hash functions that map each key in U
to the set .
We assume that each hash function maps a
uniformly at random chosen key to each element of
with equal probability.
Since we assume the hash functions are
independent, it follows that the vector is equally
likely to be any of the k-tuples of elements from
65
Algorithm
Initially all m bits of B are set to 0.
Insert x into S.
Compute
Set B[] = B[] = ... = B[] = 1.
Query if x ∈ S.
Compute .
If B[] = B[] = .. = B[] = 1
then answer Yes, else answer No.
66
Bloom Filters
Start with an m bit array, filled with 0s.
B 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
B 0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0
B 0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0
B 0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0
Three parameters.
Size m/n : bits per item.
Time k : number of hash functions.
Error f : false positive probability.
68
Overview
At the heart of every bloom filter lies two key elements
An array of n bits, initially all set to 0.
A collection of k independent hash functions h(x). Each hash function
takes a value v and generates a number i where i < n which effectively
maps to a position in the bit array.
The underlying idea of a bloom filter is quite simple
Initialize bit array of n bits with 0s. n >> no. of elements in the set.
Whenever the filter sees a new element apply each of the hash
functions h(x) on the element. With the value generated, which is an
index in the bit array, set the bit to 1 in the array.
If there are k hash functions there will be k indices generated. For each
of these k positions in the bit array set array[i] = 1
To check if element is in the set, the same procedure with a twist.
Generate k values by applying the k hash-functions on the input. If at
least one of these k indices in the bit array is set to zero then the
element is a new element else this is an existing element in the set.
69
70
Interesting Properties of Bloom
Filters
Unlike a standard hash table, a Bloom filter of a fixed size can
represent a set with an arbitrarily large number of elements.
Adding an element never fails. However, the false positive
rate increases steadily as elements are added until all bits in
the filter are set to 1, at which point all queries yield a
positive result.
Bloom filters never generate false negative result, i.e., telling
you that a username doesn’t exist when it actually exists.
Deleting elements from filter is not possible because, if we
delete a single element by clearing bits at indices generated
by k hash functions, it might cause deletion of few other
elements.
71
Some More Examples
The servers of Akamai Technologies, a content delivery provider,
use Bloom filters to prevent "one-hit-wonders" from being stored
in its disk caches.
One-hit-wonders are web objects requested by users just once,
something that Akamai found applied to nearly three-quarters of
their caching infrastructure.
Using a Bloom filter to detect the second request for a web object
and caching that object only on its second request prevents one-
hit wonders from entering the disk cache, significantly reducing
disk workload and increasing disk cache hit rates.
Google Bigtable, Apache HBase and Apache Cassandra,
and Postgresql use Bloom filters to reduce the disk lookups for
non-existent rows or columns. Avoiding costly disk lookups
considerably increases the performance of a database query
operation
72
Some Examples
The Squid Web Proxy Cache uses Bloom filters for cache digests.
Bitcoin uses Bloom filters to speed up wallet synchronization.
The Venti archival storage system uses Bloom filters to detect
previously stored data.
The SPIN model checker uses Bloom filters to track the
reachable state space for large verification problems.
The Cascading analytics framework uses Bloom filters to speed
up asymmetric joins, where one of the joined data sets is
significantly larger than the other (often called Bloom join in the
database literature).
Prime Video uses Bloom filters effectively to avoid duplicate
recommendations
Medium uses Bloom filters to avoid recommending articles a
user has previously read.
Ethereum uses Bloom filters for quickly finding logs on the
Ethereum blockchain
73
Examples
Joins on distributed relations
Spell Check
Weak Password Dictionary - Store dictionary of easily
guessable passwords as bloom filter, query when users pick
passwords.
Virus Signature detection
Inventory checks
Any Unique Identification System has to generate a unique number
for newly registered users. If the number of user registrations
increase dramatically, checking with the database is too expensive.
In this case, a bloom filter can tell if a number has already been
generated or not. If yes, simply generate a new random number
and check with the filter again. Keep doing this till the bloom filter
returns false.
74
Popular hash algorithms
DJB2
DJB2a (variant using xor rather than +)
FNV-1 (32-bit)
FNV-1a (32-bit)
SDBM
CRC32
Murmur2 (32-bit)
SuperFastHash
75
Example
In this section, we present an example using Bloom filters.
We assume having an array of 10 bits that are all set to 0.
Also, we assume using two simple hash functions:
1) h 1(x) = x mod 10
2)h 2(x)=(5x+4)mod10
0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0
76
Insertion
To insert 19 in bloom filter, we compute the
digests of h1, h2 :
h1(19) = 19 mod 10 = 9
h2(19)= (5*19+4)mod 10 = 99mod 10 = 9
Then, the bit in 9th position of the filter is set
to 1. After insertion of 19 the filter is :
77
Insert
Similarly, to insert 132 we compute
H1 (132) = 2
H2 (132) = 4.
Then the bits in positions 2 and 4 are set to 1.
The filter now is :
78
Insert
Finally, regarding insertion of 25, digests of
hash functions are
h1(25) = 5
h2(25) = 9
Bit 5 is set 1. Bit 9 is already 1, as it has been
set by the insertion of element 19. Bloom
filter after the insertion of 25 is :
79
Existence Checks
Now, we check using the previously formed Bloom filter for the
existence of the elements 133, 25 and 24 in the set A.
To check if element 133 exists in A, we first compute the digests of
h1, h2: h1(133) = 3 h2(133) = 9
Then we check whether the bits of positions 3 and 9 of the Bloom
filter are set to 1. Although bit 9 is set, the bit in position 3 is 0. As a
result, the filter returns NO.
To check if element 25 exists in A, we compute h1(25) = 5 and
h2(25) = 9. Then we check whether the bits of positions 5 and 9 of
the Bloom filter are set to 1. Indeed, both bits are 1. So, Bloom filter
returns YES. It is a true positive, as 25 exists in the set.
To check if element 24 exists in A, we compute h1(24) = 24 mod 10
= 4 & h 2(24) = 124 mod 10 = 4 . Then we check whether the bits of
position 4 is set to 1. Indeed, it is set to 1. Although Bloom filter
return again YES, element 24 does not exist in the set (resulting in a
false positive).
80
Example
81
Using the Bloom Filter
82
Sketches
Not every problem can be solved with sampling
Example: counting how many distinct items in the stream
If a large fraction of items aren’t sampled, don’t know if they are
all same or all different
Other techniques take advantage that the algorithm can
“see” all the data even if it can’t “remember” it all
“Sketch”: essentially, a linear transform of the input
Model stream as defining a vector, sketch is result of multiplying
stream vector by an (implicit) matrix
linear
projection
83
Counting
Distinct
Elements
in a
Stream
Problem Description
Given a data stream of n insertions of records,
count the number F0 of distinct records
One pass over the data stream
Algorithms must use small amount of memory
and have fast update time
too expensive to store set of distinct records
implies algorithms must be randomized and
must settle for an approximate solution: output
F Є [(1-ἐ)F0, (1+ ἐ)F0] with constant probability
85
Some Applications
How many different words are found among
the Web pages being crawled at a site?
Unusually low or high numbers could indicate
artificial pages (spam?).
How many different Web pages does each
customer request in a week?
86
Simple Solution
Keep an array, a[0, ..,U], initially set to 0.
Also keep a counter C initialized to 0.
Every time an item i arrives, look at a[i].
If it is zero, increment C, and set a[i] = 1
Return C as the number of distinct items
Time: O(1) per update and per query
But space is O(U).
What happens if we do not have enough
memory to store all the distinct items?–
The Flajolet-Martin Sketch. 87
Using Small Storage
Real Problem: what if we do not
have space to store the complete
set?
Estimate the count in an unbiased
way.
Accept that the count may be in
error, but limit the probability that
the error is large.
88
Challenge
Example (N=64)
Data stream: 3 2 5 3 2 1 7 5 1 2 3
7 5
Number of distinct values:
Stream: 8 2 1 9 1 9 2 4 4 9 4 2 5 4 2 5 8 5 2 5
Vector X:
1 2 3 4 5 6 7 8 9
Initially, x=0
Insertion of i
is interpreted as
xi = xi +1
Want to estimate DE(x) = the number of
non-zero elements of x
90
The Flajolet-Martin
algorithm
The Flajolet-Martin algorithm uses the position
of the rightmost set and unset bit to
approximate the count-distinct in a given stream.
The two seemingly unrelated concepts are
intertwined using probability.
It uses extra storage of order O(log
m) where m is the number of unique elements in
the stream and provides a practical estimate of
the cardinalities.
91
Basic Idea of FM
The basic idea.
Keep an array a[1.... log U]
Use a hash function f : {1...U} → {0.... log U}
Compute f (i) for every item in the stream,
and set a[f (i)] = 1.
Somehow extract from this the approximate
number of distinct items.
Space requirement=O(log U) = O(log N),
assuming hash functions do not require too
much of space.
92
Flajolet-Martin Approach - Intuition
94
Intuition
The probability of the rightmost set bit drops by a factor of 1/2 with
every position from the LSB to MSB
So if we keep on recording the position of the rightmost set bit, ρ, for
every element in the stream (assuming uniform distribution) we should
expect ρ = 0 to be 0.5, ρ = 1 to be 0.25, and so on. This probability
should become 0 when bit position, b is b > log m while it should be non-
zero when b <= log m where m is the number of distinct elements in the
stream.
Hence, if we find the rightmost unset bit position b such that the
probability is 0, we can say that the number of unique elements will
approximately be 2 ^ b. This forms the core intuition behind the Flajolet
Martin algorithm.
95
Flajolet-Martin Approach
Pick a hash function h that maps each of the n
elements to at least log2n bits.
For each stream element a, let r (a ) be the
number of trailing 0’s in h (a ).
Record R = the maximum r (a ) seen.
Estimate = 2R.
96
Simplified
Create a bit vector (bit array) of sufficient length L , such
that 2L>n , the number of elements in the stream.
Usually a 64-bit vector is sufficient since 2{64} is quite
large for most purposes.
The i-th bit in this vector/array represents whether we
have seen a hash function value whose binary
representation ends in 0i . So initialize each bit to 0.
Generate a good, random hash function that maps input
(usually strings) to natural numbers.
Read input. For each word, hash it and determine the
number of trailing zeros. If the number of trailing zeros is
k, set the k-th bit in the bit vector to 1.
97
Simplified
Once input is exhausted, get the index of
the first 0 in the bit array (call this R). By
the way, this is just the number of
consecutive 1s (i.e. we have seen 0,
00, ..., as the output of the hash
function) plus one.
Calculate the number of unique words
as
2^R * Constant .
98
Simple Explanation
We start with defining a closed hash range, big enough to hold the
maximum number of unique values possible - something as big as 2
^ 64.
Every element of the stream is passed through a hash function that
permutes the elements in a uniform distribution.
For this hash value, we find the position of the rightmost set bit and
mark the corresponding position in the bit vector as 1, suggesting
that we have seen the position.
Once all the elements are processed, the bit vector will have 1s at
all the positions corresponding to the position of every rightmost
set bit for all elements in the stream.
Now we find the position, b, of the rightmost 0 in this bit vector.
This position b corresponds to the rightmost set bit that we have
not seen while processing the elements.
This corresponds to the probability 0 and hence as per the intuition
will help in approximating the cardinality as 2 ^ b.
99
FM Algorithm
Use r hash functions to create
r FM Sketches
Initialize each FM to zero
For each record x in dataset FM1 1 0 1 0
For each hash function hi(x)
B1 = 1
FM [pivot] = 1;
i
B = (1 + 2 + 2)/3 = 1.67
100
101
102
Examples
int X binary R(x)
format
X=10=(1010)2 0 0000 4 (=L)
1 0001 0
bit(y,0)=0 bit(y,1)=1
2 0010 1
bit(y,2)=0 bit(y,3)=1 3 0011 0
4 0100 2
5 0101 0
6 0110 1
7 0111 0
8 1000 3
103
Flajolet-Martin Approach – Estimate
Example
106
Variations of F-M
Algorithm
Take the mean of the k results together from each hash-function,
obtaining a single estimate of the cardinality.
A different idea is to use the median which is less prone to be
influences by outliers.
Another problem with this is that the results can only take
form as some power of 2
A common solution is to combine both the mean and the
median:
Create k⋅ℓ hash-functions and split them into k distinct groups
(each of size ℓ).
Within each group use the median for aggregating together
the ℓ results
Finally take the mean of the k group estimates as the final
estimate.
107
Space Requirement
As we read the stream it is not necessary to store the
elements seen.
The only thing we need to keep in main memory is one
integer per hash function; this integer records the largest tail
length seen so far for that hash function and any stream
element.
If we are processing only one stream, we could use millions
of hash functions, which is far more than we need to get a
close estimate.
Only if we are trying to process many streams at the same
time would main memory constrain the number of hash
functions we could associate with any one stream.
In practice, the time it takes to compute hash values for each
stream element would be the more significant limitation on
the number of hash functions we use. 108
Applications
Web sites often gather statistics on how many unique
users it has seen in each given month. The universal set
is the set of logins for that site, and a stream element is
generated each time a user logs in.
Amazon: user logs in with their unique login name.
Google: identifies users by IP addresses.
Radio-frequency identification (RFID) technology uses
RFID tags and RFID readers (or simply called tags and
readers) to monitor objects in physical world.
Many events (e.g., TedEx) distribute RFID wrist bands to their
visitors. RFID counting helps reveal the number of people
around.
109
Applications
DNA Motifs: Sequence motifs are short,
recurring patterns in DNA that are presumed
to have a biological function.
Number of distinct motifs indicate valuable
biological information about the specific DNA
sequence.
Denial of service attacks signaled by large
numbers of requests from spoofed IPs.
Counting distinct elements provide valuable
statistics in these cases.
110
Duplicate Insensitive
Counting
Distinct-values estimation can also be used as a general tool for
Each item to be counted views its unique id as its “value”, so
that the number of distinct values equals the number of items
to be counted.
Duplicate-insensitive counting is useful in mobile computing to
avoid double counting nodes that are in motion.
It can also be used to compute the number of distinct
neighborhoods at a given hop-count from a node and the size of
the transitive closure of a graph.
In a sensor network, duplicate insensitive counting together
with multi-path in-network aggregation enables robust and
energy-efficient answers to count queries
Moreover, duplicate insensitive counting is a building block for
duplicate-insensitive computation of other aggregates, such as
sum and average.
111
Some Results
Wikipedia article on "United States Constitution" had 3978
unique words. When run ten times, Flajolet-Martin
algorithm reported values of 4902, 4202, 4202, 4044, 4367,
3602, 4367, 4202, 4202 and 3891 for an average of 4198. As
can be seen, the average is about right, but the deviation is
between -400 to 1000.
The algorithm was run on the text dump of The Jungle Book
by Rudyard Kipling. The text was converted into a stream of
tokens and it was found that the total number of unique
tokens was 7150. The approximation of the same using the
Flajolet-Martin algorithm came out to be 7606 which in fact
is pretty close to the actual number.
112
Extra Examples
Stream: 4, 2, 5 ,9, 1, 6, 3, 7
Hash function, h(x) = (ax + b) mod 32
a) h(x) = 3x + 7 mod 32 b) h(x) = x + 6 mod 32
a) h(x) = 3x + 7 mod 32
h(4) = 3(4) + 7 mod 32 = 19 mod 32 = 19 = (10011)
h(2) = 3(2) + 7 mod 32 = 13 mod 32 = 13 = (01101)
h(5) = 3(5) + 7 mod 32 = 22 mod 32 = 22 = (10110)
h(9) = 3(9) + 7 mod 32 = 34 mod 32 = 2 = (00010)
h(1) = 3(1) + 7 mod 32 = 10 mod 32 = 10 = (01010)
h(6) = 3(6) + 7 mod 32 = 25 mod 32 = 25 = (11001)
h(3) = 3(3) + 7 mod 32 = 16 mod 32 = 16 = (10000)
h(7) = 3(7) + 7 mod 32 = 28 mod 32 = 28 = (11100)
Trailing zero's {0, 0, 1, 1, 1, 0, 4, 2}
R = max [Trailing Zero] = 4 ---- Output = 2R = 24 = 16
113
Queries over a
(long) Sliding Window
Sliding Windows
A useful model of stream processing is that
queries are about a window of length N –
the N most recent elements received
Interesting case: N is so large that the data cannot
be stored in memory, or even on disk
Or, there are so many streams that windows
for all cannot be stored
Amazon example:
For every product X we keep 0/1 stream of whether
that product was sold in the n-th transaction
We want answer queries, how many times have we sold
X in the last k sales 115
Examples
Example: For each spam mail seen we
emitted a 1. Now, we want to always know
how many of the last million emails were
spam
Example: For each tweet seen we emitted a 1
if it is positive remark. Now, we want to
always know how many of the billion recent
tweets were positive sentiments
116
Sliding Window: 1
Stream
Sliding window on a single stream: N=6
qwertyuiopasdfghjklzxcvbnm
qwertyuiopasdfghjklzxcvbnm
qwertyuiopasdfghjklzxcvbnm
qwertyuiopasdfghjklzxcvbnm
Past Future
117
Counting Bits (1)
Problem:
Given a stream of 0s and 1s
Be prepared to answer queries of the form
How many 1s are in the last k bits? where k ≤ N
Obvious solution:
Store the most recent N bits
When new bit comes in, discard the N+1st bit
118
Counting Bits (2)
You can not get an exact answer without
storing the entire window
Real Problem:
What if we cannot afford to store N bits?
E.g., we’re processing 1 billion streams and
N = 1 billion 0 1 0 0 1 1 0 1 1 1 0 1 0 1 0 1 1 0 1 1 0 1 1 0
Past Future
Maintain 2 counters:
S: number of 1s from the beginning of the stream
Z: number of 0s from the beginning of the stream
How many 1s are in the last N bits?
But, what if stream is non-uniform?
What if distribution changes over time?
120
DGIM Method
DGIM (Datar-Gionis-Indyk-Motwani) solution
that does not assume uniformity
We store bits per stream
Solution gives approximate answer,
never off by more than 50%
Error factor can be reduced to any fraction > 0,
with more complicated algorithm and
proportionally more stored bits
121
DGIM - Overview
Problem Statement:
Given a stream of bits (0's and 1's), maintain an
approximation of the count of 1's in the last N bits
of the stream, using sub-linear space with respect
to N.
Overview:
The algorithm uses a bucket-based approach
where we group consecutive 1's into "buckets" of
different sizes.
These buckets are managed to ensure space
efficiency while providing an approximate count of
1's in the sliding window of the last N bits.
122
Key Concepts of the DGIM Algorithm:
Buckets:
Instead of storing the entire binary stream, the algorithm
groups consecutive 1's into "buckets."
Each bucket is represented by two key pieces of information:
Size: The number of 1's it contains.
Timestamp: The position of the most recent 1 in the bucket (i.e.,
the time at which the last 1 in the bucket appeared in the stream).
Exponential Bucket Sizes:
The buckets are organized in a way that their sizes are
powers of two (e.g., 1, 2, 4, 8, ...).
There can only be at most two buckets of each size at any
time. This restriction allows the algorithm to maintain a
logarithmic number of buckets.
123
Key Concepts of the DGIM Algorithm:
Sliding Window:
The algorithm operates on a sliding window of size
N, which represents the last N bits of the stream.
As new bits arrive, older bits fall out of the window.
The algorithm must update the buckets to reflect
the current window's contents.
Approximation:
The algorithm provides an approximate count of 1's
in the last N bits. The count may be off by at most
50% because of the bucket approximation. However,
this tradeoff allows for significant space savings.
124
DGIM method
Idea: Instead of summarizing fixed-length
blocks, summarize blocks with specific
number of 1s:
Let the block sizes (number of 1s) increase
exponentially
125
DGIM: Timestamps
Each bit in the stream has a timestamp,
starting 1, 2, …
Record timestamps modulo N (the window
size), so we can represent any relevant
timestamp in bits
126
DGIM: Buckets
A bucket in the DGIM method is a record
consisting of:
(A) The timestamp of its end [O(log N) bits]
(B) The number of 1s between its beginning and
end [O(log log N) bits]
Constraint on buckets:
Number of 1s must be a power of 2
That explains the O(log log N) in (B) above
0101011000101101010101010101101010101010111010101011101010001011
N
127
Reasoning
To represent a bucket, we need log2 N bits to
represent the timestamp (modulo N) of its right
end.
To represent the number of 1’s we only need
log2 log2 N bits.
The reason is that we know this “number i” is a
power of 2, say 2j, so we can represent “i” by
representing (encoding) as “j” in binary.
Since j is at most log2 N, it requires log2 log2 N
bits.
Thus, O(logN)bits suffice to represent a bucket.
128
Representing a Stream by Buckets
129
In a nut shell
130
Example: Bucketized
Stream
At least 1 of 2 of 2 of 1 of 2 of
size 16. Partially size 8 size 4 size 2 size 1
beyond window.
1010110001011010101010101011010101010101110101010111010100010110
132
Updating Buckets (2)
If the current bit is 1:
(1) Create a new bucket of size 1, for just this bit
End timestamp = current time
(2) If there are now three buckets of size 1,
combine the oldest two into a bucket of size 2
(3) If there are now three buckets of size 2,
combine the oldest two into a bucket of size 4
(4) And so on …
133
Example: Updating
Buckets
Current state of the stream:
1010110001011010101010101011010101010101110101010111010100010110
Next bit 1 arrives, new orange bucket is created, then 0 comes, then 1:
1100010110101010101010110101010101011101010101110101000101100101
134
How to Query?
To estimate the number of 1s in the most
recent N bits:
1. Sum the sizes of all buckets but the last
(note “size” means the number of 1s in the
bucket)
2. Add half the size of the last bucket
135
Example: Bucketized
Stream
At least 1 of 2 of 2 of 1 of 2 of
size 16. Partially size 8 size 4 size 2 size 1
beyond window.
1010110001011010101010101011010101010101110101010111010100010110
136
137
138
Counting 1’s in a
Window
* * 1 0 1 1 0 1 1 0 0 0 1 0 1 1 1 0 1 1 0 0 1 0 1 1 0
Timestamps 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
139
Error Bound: Proof
Why is error 50%? Let’s prove it!
Suppose the last bucket has size 2r
Then by assuming 2r-1 (i.e., half) of its 1s are
still within the window, we make an error of
at most 2r-1
Since there is at least one bucket of each of
the sizes less than 2r, the true sum is at least
1 + 2 + 4 + .. + 2r-1 = 2r -1
Thus, error at most 50% At least 16 1s
11111110000000011101010101011010101010101110101010111010100010110
N
140
Further Reducing the
Error
Instead of maintaining 1 or 2 of each size
bucket, we allow either r-1 or r buckets (r > 2)
Except for the largest size buckets; we can have
any number between 1 and r of those
Error is at most O(1/r)
By picking r appropriately, we can tradeoff
between number of bits we store and the
error
141
Applications
Network Monitoring: Estimate suspicious packets (e.g., DDoS
attacks) in recent traffic.
Web Analytics: Count clicks or active users in a recent window.
Log Analysis: Monitor error/warning logs in real-time systems.
Sensor Networks: Estimate recent sensor events (e.g., motion
detection).
Financial Trading: Track recent buy/sell signals for trading
decisions.
Social Media: Monitor likes/shares on posts in real-time.
Spam Detection: Count spam messages in recent emails.
Smart Grids: Detect power outages in a sliding window for grid
management.
Video Surveillance: Track motion detection events in security
systems.
Fraud Detection: Estimate suspicious transactions in real-time
fraud systems.
142
Extensions
Can we use the same trick to answer queries
How many 1’s in the last k? where k < N?
A: Find earliest bucket B that at overlaps with k.
Number of 1s is the sum of sizes of more recent
buckets + ½ size of B
1010110001011010101010101011010101010101110101010111010100010110
k
144
Summary
Stream Computation and Model
Characteristics of Stream mining
Algorithms
Sampling
Filtering – Bloom Filter
Counting Distinct Elements - FM
Algorithm
Counting the number of 1s in the last N
elements - DGIM Algorithm
145