https://chauff.github.io/documents/bdp-quiz/streaming.
html
Module 4: Mining Data Streams: 23% weightage
4.1 The Stream Data Model: A Data-Stream-Management System, Examples of Stream Sources, Stream Queries, Issues in
Stream Processing.
4.2 Sampling Data techniques in a Stream
4.3 Filtering Streams: Bloom Filter with Analysis.
4.4 Counting Distinct Elements in a Stream, Count-Distinct Problem, Flajolet-Martin Algorithm, Combining Estimates, Space
Requirements
4.5 Counting Frequent Items in a Stream, Sampling Methods for Streams, Frequent Itemsets in Decaying Windows.
4.6 Counting Ones in a Window: The Cost of Exact Counts, The Datar-Gionis-Indyk-Motwani Algorithm,
Query Answering in the DGIM Algorithm, Decaying Windows.
Course Outcomes: Learner will be able to…
1. Understand the key issues in big data management and its associated applications for business decisions and strategy.
1. Develop problem solving and critical thinking skills in fundamental enabling techniques like Hadoop, Mapreduce and NoSQL
in big data analytics.
2. Collect, manage, store, query and analyze various forms of Big Data.
3. Interpret business models and scientific computing paradigms, and apply software tools for big data analytics.
4. Adapt adequate perspectives of big data analytics in various applications like recommender systems, social media applications
etc.
5. Solve Complex real world problems in various applications like recommender systems, social media applications, health and
medical systems, etc.
Counting One’s in Windows
• One of the important operations performed on the streams is counting of number of
occurrences of a particular element in a stream
• Consider a sliding window of length N=6 on a single stream as shown in figure 1. As
the stream content varies over time the sliding window highlights new stream
elements.
Example1: Consider Amazon online transactions. For every
product X we keep 0/1 stream of whether that product was
sold in the n-th transaction. A query like, “how many times
have we sold X in the last k sales?” and an answer for it can be
derived using sliding window concept.
Counting One’s in Windows
In a sliding window, tuples are grouped within a window that slides across the
data stream according to a specified interval. A time-based sliding window with
a length of ten seconds and a sliding interval of five seconds contains tuples
that arrive within a ten-second window. The set of tuples within the window are
evaluated every five seconds. Sliding windows can contain overlapping data; an
event can belong to more than one sliding window.
DGIM Algorithm
• DGIM is Datar-Gionis-Indynk-Motwani Algorithm.
• This algorithm is designed to find the no. of 1’s in a dataset.
• Algorithm uses O(log²N) bits to represent a window of N bit
and allows to estimate the number of 1’s in the window with
an error of no more that 50 %.
DGIM Algorithm
• Each bit stream has a timestamp for the position at which it
arrives.
• The first bit has a timestamp1 the second has timestamp2 and
so on.
• The positions are recognized with the window size and the
window size are usually taken as a multiple of two
• The timestamp is represented with modulo N are represented
as Log2 N bits.
DGIM Algorithm
• The windows are divided into buckets consisting of time stamp at its
right end.
• We call it as buckets as the buckets will be consisting of bits 0’s and
1’s.
• The number of ones must be in the power of two which are
referred to as size of the bucket.
• That is, we will be considering the bucket size starting from 1,2,4,8
and so on.
• Eg: 1001011 -> bucket size = 4.
Rules for forming the buckets
• The right end of a bucket is always a position with a 1.
• Every position with a 1 is in some bucket.
• No position is in more than one bucket.
• There are one or two buckets of any given size, up to some
maximum size.
• All sizes must be a power of 2.
• Buckets cannot decrease in size as we move to the left (back in
time).
Example:
• The input bit stream is:
101011000101110110010110
• N= 24
• Estimate the no. of buckets and no. of 1’s
Soln:
No of 1’s in the given input stream is 13.
Split 13 into 4,4,2,2,1
101011 000 10111 0 11 00 101 1 0
Bucket size=4 Bucket size=4 Bucket size=2 Bucket size=2 Bucket size=1
Updating Buckets
• If a new bit comes in, drop the last part of the bucket if its end
time is prior to N time units before the current time.
• From previous eg., if N=24, if I add one more bit, N becomes
25.
So what has to be done now??
• The bit stream is always added to the right side of the previous
data
• If 1 is the bit that has to be added then,
• Drop 1 from the left side and add new bucket
on the right side.
101011 000 10111 0 11 00 101 1 0 1
010110001 011101 1001 011 0 1
Bucket size=4 Bucket size=4 Bucket size=2 Bucket size=2 Bucket size=1
Example 2:
• The previous input bit stream was:
• 101011000101110110010110
• Now we should add new input stream
10101011
• If we have to add in N=24, we have to drop
many bits.
• So lets increase the value of N.
• Therefore, N=32 now.
• Now the new input stream becomes (after
adding of new stream):
101011000101110110010110 10101011
Added stream bit
• No of 1’s = 18
• The buckets are formed as:
10101100010111 01100101 101 101 0 1 1
Bucket size=8 Bucket size=4 Bucket size=2 Bucket size=2 Bucket size
=1
Space Requirement
• 1)A single bucket is represented using O(log N)bits
• 2) Number of buckets is O(log N)
• 2) Total space required = O(log2 N)
Querying the bucket size
• To estimate the number of 1s in the most recent N bits, we can sum the
sizes of all buckets but the last (note “size” means the number of 1s in
the bucket), then add half the size of the last bucket.
• We must remember the fact that we do not know how many 1s of the
last bucket are still within the wanted window.
the true sum is at least 1 + 2 + 4 + .. + 2r-1 = 2 r -1. Thus, error at most 50%.
Decaying Window
• The Problem with sliding window of fixed size is, it odes not
take into account the older elements which are outside the
window
• This use exponentially decaying windows which takes into
account all of the elements in the stream assignmening
different weightage to them(recent elements gets more
weightage compared to older)
• This type of window is suitable for answering queries to most
common recent elements eg. Most Popular recent movie
•
• In a data stream consisting of various elements, you maintain a separate sum for each
distinct element. For every incoming element, you multiply the sum of all the existing
elements by a value of (1−c). Further, you add the weight of the incoming element to its
corresponding aggregate sum.
A threshold can be kept to, ignore elements of weight lesser than that.
Finally, the element with the highest aggregate score is listed as the most popular
element.
Example
For example, consider a sequence of twitter tags below:
fifa, ipl, fifa, ipl, ipl, ipl, fifa
Also, let's say each element in sequence has weight of 1.
Let's c be 0.1
The aggregate sum of each tag in the end of above stream will be calculated as
below:
fifa
fifa - 1 * (1-0.1) = 0.9
ipl - 0.9 * (1-0.1) + 0 = 0.81 (adding 0 because current tag is different than fifa)
fifa - 0.81 * (1-0.1) + 1 = 1.729 (adding 1 because current tag is fifa only)
ipl - 1.729 * (1-0.1) + 0 = 1.5561
ipl - 1.5561 * (1-0.1) + 0 = 1.4005
ipl - 1.4005 * (1-0.1) + 0 = 1.2605
fifa - 1.2605 * (1-0.1) + 1 = 2.135
ipl
fifa - 0 * (1-0.1) = 0
ipl - 0 * (1-0.1) + 1 = 1
fifa - 1 * (1-0.1) + 0 = 0.9 (adding 0 because current tag is different than ipl)
ipl - 0.9 * (1-0.01) + 1 = 1.81
ipl - 1.81 * (1-0.01) + 1 = 2.7919 In the end of the sequence, we can see the score of fifa is 2.135 but ipl is
ipl - 2.7919 * (1-0.01) + 1 = 3.764 3.7264
fifa - 3.764 * (1-0.01) + 0 = 3.7264 So, ipl is more trending then fifa
Even though both of them occurred same number of times in input there score
is still different.
Advantages of Decaying Window Algorithm:
1.Sudden spikes or spam data is taken care.
2.New element is given more weight by this mechanism, to achieve right
trending output.