Module 4

The document outlines Module 4 of a course on Mining Data Streams, covering key topics such as stream data models, sampling techniques, and algorithms for counting distinct and frequent items in data streams. It introduces the DGIM algorithm for counting ones in a sliding window and discusses the concept of decaying windows for managing data relevance over time. The course aims to equip learners with skills in big data management and analytics applicable to various real-world scenarios.

Uploaded by

Biya Rahul

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

85 views20 pages

Module 4

Uploaded by

Biya Rahul

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

https://chauff.github.io/documents/bdp-quiz/streaming.

html

Module 4: Mining Data Streams: 23% weightage

4.1 The Stream Data Model: A Data-Stream-Management System, Examples of Stream Sources, Stream Queries, Issues in
Stream Processing.
4.2 Sampling Data techniques in a Stream
4.3 Filtering Streams: Bloom Filter with Analysis.
4.4 Counting Distinct Elements in a Stream, Count-Distinct Problem, Flajolet-Martin Algorithm, Combining Estimates, Space
Requirements
4.5 Counting Frequent Items in a Stream, Sampling Methods for Streams, Frequent Itemsets in Decaying Windows.
4.6 Counting Ones in a Window: The Cost of Exact Counts, The Datar-Gionis-Indyk-Motwani Algorithm,
Query Answering in the DGIM Algorithm, Decaying Windows.

Course Outcomes: Learner will be able to…

1. Understand the key issues in big data management and its associated applications for business decisions and strategy.
1. Develop problem solving and critical thinking skills in fundamental enabling techniques like Hadoop, Mapreduce and NoSQL
in big data analytics.
2. Collect, manage, store, query and analyze various forms of Big Data.
3. Interpret business models and scientific computing paradigms, and apply software tools for big data analytics.
4. Adapt adequate perspectives of big data analytics in various applications like recommender systems, social media applications
etc.
5. Solve Complex real world problems in various applications like recommender systems, social media applications, health and
medical systems, etc.
Counting One’s in Windows
• One of the important operations performed on the streams is counting of number of
occurrences of a particular element in a stream
• Consider a sliding window of length N=6 on a single stream as shown in figure 1. As
the stream content varies over time the sliding window highlights new stream
elements.

Example1: Consider Amazon online transactions. For every

product X we keep 0/1 stream of whether that product was
sold in the n-th transaction. A query like, “how many times
have we sold X in the last k sales?” and an answer for it can be
derived using sliding window concept.
Counting One’s in Windows

In a sliding window, tuples are grouped within a window that slides across the
data stream according to a specified interval. A time-based sliding window with
a length of ten seconds and a sliding interval of five seconds contains tuples
that arrive within a ten-second window. The set of tuples within the window are
evaluated every five seconds. Sliding windows can contain overlapping data; an
event can belong to more than one sliding window.
DGIM Algorithm

• DGIM is Datar-Gionis-Indynk-Motwani Algorithm.

• This algorithm is designed to find the no. of 1’s in a dataset.
• Algorithm uses O(log²N) bits to represent a window of N bit
and allows to estimate the number of 1’s in the window with
an error of no more that 50 %.
DGIM Algorithm
• Each bit stream has a timestamp for the position at which it
arrives.
• The first bit has a timestamp1 the second has timestamp2 and
so on.
• The positions are recognized with the window size and the
window size are usually taken as a multiple of two
• The timestamp is represented with modulo N are represented
as Log2 N bits.
DGIM Algorithm
• The windows are divided into buckets consisting of time stamp at its
right end.
• We call it as buckets as the buckets will be consisting of bits 0’s and
1’s.
• The number of ones must be in the power of two which are
referred to as size of the bucket.
• That is, we will be considering the bucket size starting from 1,2,4,8
and so on.
• Eg: 1001011 -> bucket size = 4.
Rules for forming the buckets
• The right end of a bucket is always a position with a 1.
• Every position with a 1 is in some bucket.
• No position is in more than one bucket.
• There are one or two buckets of any given size, up to some
maximum size.
• All sizes must be a power of 2.
• Buckets cannot decrease in size as we move to the left (back in
time).
Example:
• The input bit stream is:
101011000101110110010110
• N= 24
• Estimate the no. of buckets and no. of 1’s
Soln:
No of 1’s in the given input stream is 13.
Split 13 into 4,4,2,2,1
101011 000 10111 0 11 00 101 1 0
Bucket size=4 Bucket size=4 Bucket size=2 Bucket size=2 Bucket size=1
Updating Buckets
• If a new bit comes in, drop the last part of the bucket if its end
time is prior to N time units before the current time.
• From previous eg., if N=24, if I add one more bit, N becomes
25.
So what has to be done now??
• The bit stream is always added to the right side of the previous
data
• If 1 is the bit that has to be added then,
• Drop 1 from the left side and add new bucket
on the right side.
101011 000 10111 0 11 00 101 1 0 1

010110001 011101 1001 011 0 1

Bucket size=4 Bucket size=4 Bucket size=2 Bucket size=2 Bucket size=1
Example 2:
• The previous input bit stream was:
• 101011000101110110010110
• Now we should add new input stream
10101011
• If we have to add in N=24, we have to drop
many bits.
• So lets increase the value of N.
• Therefore, N=32 now.
• Now the new input stream becomes (after
adding of new stream):
101011000101110110010110 10101011
Added stream bit

• No of 1’s = 18
• The buckets are formed as:
10101100010111 01100101 101 101 0 1 1
Bucket size=8 Bucket size=4 Bucket size=2 Bucket size=2 Bucket size
=1
Space Requirement
• 1)A single bucket is represented using O(log N)bits
• 2) Number of buckets is O(log N)
• 2) Total space required = O(log2 N)
Querying the bucket size
• To estimate the number of 1s in the most recent N bits, we can sum the
sizes of all buckets but the last (note “size” means the number of 1s in
the bucket), then add half the size of the last bucket.
• We must remember the fact that we do not know how many 1s of the
last bucket are still within the wanted window.

the true sum is at least 1 + 2 + 4 + .. + 2r-1 = 2 r -1. Thus, error at most 50%.
Decaying Window
• The Problem with sliding window of fixed size is, it odes not
take into account the older elements which are outside the
window
• This use exponentially decaying windows which takes into
account all of the elements in the stream assignmening
different weightage to them(recent elements gets more
weightage compared to older)
• This type of window is suitable for answering queries to most
common recent elements eg. Most Popular recent movie
•
• In a data stream consisting of various elements, you maintain a separate sum for each
distinct element. For every incoming element, you multiply the sum of all the existing
elements by a value of (1−c). Further, you add the weight of the incoming element to its
corresponding aggregate sum.
A threshold can be kept to, ignore elements of weight lesser than that.
Finally, the element with the highest aggregate score is listed as the most popular
element.
Example
For example, consider a sequence of twitter tags below:
fifa, ipl, fifa, ipl, ipl, ipl, fifa
Also, let's say each element in sequence has weight of 1.
Let's c be 0.1
The aggregate sum of each tag in the end of above stream will be calculated as
below:
fifa
fifa - 1 * (1-0.1) = 0.9
ipl - 0.9 * (1-0.1) + 0 = 0.81 (adding 0 because current tag is different than fifa)
fifa - 0.81 * (1-0.1) + 1 = 1.729 (adding 1 because current tag is fifa only)
ipl - 1.729 * (1-0.1) + 0 = 1.5561
ipl - 1.5561 * (1-0.1) + 0 = 1.4005
ipl - 1.4005 * (1-0.1) + 0 = 1.2605
fifa - 1.2605 * (1-0.1) + 1 = 2.135

ipl
fifa - 0 * (1-0.1) = 0
ipl - 0 * (1-0.1) + 1 = 1
fifa - 1 * (1-0.1) + 0 = 0.9 (adding 0 because current tag is different than ipl)
ipl - 0.9 * (1-0.01) + 1 = 1.81
ipl - 1.81 * (1-0.01) + 1 = 2.7919 In the end of the sequence, we can see the score of fifa is 2.135 but ipl is
ipl - 2.7919 * (1-0.01) + 1 = 3.764 3.7264
fifa - 3.764 * (1-0.01) + 0 = 3.7264 So, ipl is more trending then fifa
Even though both of them occurred same number of times in input there score
is still different.

Advantages of Decaying Window Algorithm:

1.Sudden spikes or spam data is taken care.
2.New element is given more weight by this mechanism, to achieve right
trending output.

Aspie Adulting Guide: Work, Home, Life
100% (2)
Aspie Adulting Guide: Work, Home, Life
64 pages
Homework and Remembering 5th Grade Volume 1
50% (2)
Homework and Remembering 5th Grade Volume 1
5 pages
Understanding Literature Review in Research
No ratings yet
Understanding Literature Review in Research
9 pages
Lesson Plan Grade 2 Competency 1 Quarter 1
No ratings yet
Lesson Plan Grade 2 Competency 1 Quarter 1
17 pages
Comparative SComparative Study The Kurt Lewin of Changtudy The Kurt Lewin of Chang
100% (1)
Comparative SComparative Study The Kurt Lewin of Changtudy The Kurt Lewin of Chang
4 pages
Worksheet 1 - Is Slavery A Thing of The Past?
No ratings yet
Worksheet 1 - Is Slavery A Thing of The Past?
1 page
Strategy Output Activity (Ppa) : Activities (Ppas) For The Social Sector Activities (Ppas) For The Education Sub-Sector
100% (2)
Strategy Output Activity (Ppa) : Activities (Ppas) For The Social Sector Activities (Ppas) For The Education Sub-Sector
3 pages
EDUC 5010 Written Assignment U1
No ratings yet
EDUC 5010 Written Assignment U1
7 pages
Cognitive Learning Strategies Guide
No ratings yet
Cognitive Learning Strategies Guide
1 page
Affirmations Creation Worksheet 1
100% (1)
Affirmations Creation Worksheet 1
4 pages
4530 - CIP Interim Report - Ruchi
No ratings yet
4530 - CIP Interim Report - Ruchi
15 pages
Eng-Improve Frequent Pattern Mining in Data Stream-Himanshu Shah
No ratings yet
Eng-Improve Frequent Pattern Mining in Data Stream-Himanshu Shah
10 pages
LP2 Perdev
No ratings yet
LP2 Perdev
12 pages
Streaming Algorithms Overview
No ratings yet
Streaming Algorithms Overview
90 pages
Revival and Reinvention of Kathak Dance
No ratings yet
Revival and Reinvention of Kathak Dance
14 pages
Streaming Algorithms: Ajinkya Potdar Hemanga Krishna Borah
No ratings yet
Streaming Algorithms: Ajinkya Potdar Hemanga Krishna Borah
47 pages
Shadowing Technique Boosts Pronunciation
No ratings yet
Shadowing Technique Boosts Pronunciation
20 pages
L13-16 Sequential Patterns
No ratings yet
L13-16 Sequential Patterns
36 pages
Data Science 5
No ratings yet
Data Science 5
82 pages
Mining Data Streams (Part 1)
No ratings yet
Mining Data Streams (Part 1)
46 pages
02 - 2012 - PTPAPE - Pipeline Engineer
No ratings yet
02 - 2012 - PTPAPE - Pipeline Engineer
2 pages
Script For Project Control
No ratings yet
Script For Project Control
8 pages
Data Mining
No ratings yet
Data Mining
7 pages
Implementing DGIM Algorithm
No ratings yet
Implementing DGIM Algorithm
3 pages
Data Stream Processing Insights
No ratings yet
Data Stream Processing Insights
67 pages
Business Plan Template
No ratings yet
Business Plan Template
10 pages
Weiss, A.P. Relation Between Functional and Behavior Psychology
No ratings yet
Weiss, A.P. Relation Between Functional and Behavior Psychology
16 pages
Lightweight Edge Detection Network
No ratings yet
Lightweight Edge Detection Network
15 pages
Unit 4 - Lecture 3 - DGIM Algorithm Notes
100% (1)
Unit 4 - Lecture 3 - DGIM Algorithm Notes
8 pages
BDA Questions
No ratings yet
BDA Questions
20 pages
Calander 2018-2019 Tusd
No ratings yet
Calander 2018-2019 Tusd
1 page
Ch05a Streams1
No ratings yet
Ch05a Streams1
48 pages
02 StreamsAlgorithms
No ratings yet
02 StreamsAlgorithms
93 pages
Streams 1
No ratings yet
Streams 1
33 pages
5.1 Mining Data Streams
No ratings yet
5.1 Mining Data Streams
16 pages
Udemy Business 2023 WorkplaceLearningTrends Report
No ratings yet
Udemy Business 2023 WorkplaceLearningTrends Report
34 pages
DGIM Example
No ratings yet
DGIM Example
4 pages
Counting Ones in A Window: The Cost of Exact Counts
100% (1)
Counting Ones in A Window: The Cost of Exact Counts
13 pages
Mmd04A Streams
No ratings yet
Mmd04A Streams
78 pages
Counting Oneness in A Window
No ratings yet
Counting Oneness in A Window
12 pages
Sliding Window - Without Code
No ratings yet
Sliding Window - Without Code
72 pages
Mmu PHD Thesis Guidelines
100% (4)
Mmu PHD Thesis Guidelines
8 pages
Module 3 Mining Data Streams
No ratings yet
Module 3 Mining Data Streams
96 pages
DGIM Algorithm Theory Explanation
0% (1)
DGIM Algorithm Theory Explanation
2 pages
Big Dta Analytics
No ratings yet
Big Dta Analytics
7 pages
Counting Ones in A Window
No ratings yet
Counting Ones in A Window
11 pages
Bda Unit - 2
No ratings yet
Bda Unit - 2
12 pages
Student Writing Prompts Guide
0% (1)
Student Writing Prompts Guide
3 pages
BMC 312 Strat MGT
No ratings yet
BMC 312 Strat MGT
16 pages
DA Unit 3
No ratings yet
DA Unit 3
12 pages
Bda A4
No ratings yet
Bda A4
10 pages
B43 BDA Exp7
No ratings yet
B43 BDA Exp7
12 pages
4 Bda Chapter4 Answer
No ratings yet
4 Bda Chapter4 Answer
6 pages
Philosophy and Life's Meaning
No ratings yet
Philosophy and Life's Meaning
9 pages
Decaying Window
No ratings yet
Decaying Window
16 pages
Book 160 163
No ratings yet
Book 160 163
4 pages
Mining Data Stream
No ratings yet
Mining Data Stream
31 pages
Module 2 Session 7 Counting of Ones in A Window Decaying Windows
No ratings yet
Module 2 Session 7 Counting of Ones in A Window Decaying Windows
3 pages
Case 7 (Interviewee 1) Gulfraz Ahmed
No ratings yet
Case 7 (Interviewee 1) Gulfraz Ahmed
8 pages
Data Analytics (Unit-03) - 7777
No ratings yet
Data Analytics (Unit-03) - 7777
48 pages
BDA Experiment 7
No ratings yet
BDA Experiment 7
7 pages
Bda Unit3
No ratings yet
Bda Unit3
22 pages
Implementing DGIM Algorithm
No ratings yet
Implementing DGIM Algorithm
6 pages
Political Economy - Version 1.2 - July 2021
No ratings yet
Political Economy - Version 1.2 - July 2021
13 pages
Bda PT 2
No ratings yet
Bda PT 2
35 pages
Udemy Course Quality+Checklist
No ratings yet
Udemy Course Quality+Checklist
1 page
Sequential Pattern Mining in Data Streams Using The Weighted Sliding Window
No ratings yet
Sequential Pattern Mining in Data Streams Using The Weighted Sliding Window
5 pages
BDA Notes Part 2
No ratings yet
BDA Notes Part 2
5 pages
Unit 3
No ratings yet
Unit 3
30 pages
Mining Data Streams
No ratings yet
Mining Data Streams
34 pages
Counting Ones in A Window
No ratings yet
Counting Ones in A Window
27 pages
Edwards Government in America People Politics and Policy 2016 Election Edition 17th Edition Ap Edition George C Edwards Instant Download
100% (3)
Edwards Government in America People Politics and Policy 2016 Election Edition 17th Edition Ap Edition George C Edwards Instant Download
29 pages
Mining Data Streams
No ratings yet
Mining Data Streams
37 pages
Data Stream Sampling
No ratings yet
Data Stream Sampling
25 pages
Don Bosco Institute of Technology: ITDO8011 Big Data Analytics
No ratings yet
Don Bosco Institute of Technology: ITDO8011 Big Data Analytics
6 pages
Unit 3
No ratings yet
Unit 3
49 pages
What Is Anthropology 2nd Edition Thomas Hylland Eriksen Download
No ratings yet
What Is Anthropology 2nd Edition Thomas Hylland Eriksen Download
48 pages
Streaming Algorithms Explained
No ratings yet
Streaming Algorithms Explained
4 pages
Big Data Unit III
No ratings yet
Big Data Unit III
20 pages
1.1 The Problem of Most-Common Elements
No ratings yet
1.1 The Problem of Most-Common Elements
3 pages
Blooms Filter
No ratings yet
Blooms Filter
15 pages
DGIM
No ratings yet
DGIM
90 pages
Module 3 Mining Data Streams
No ratings yet
Module 3 Mining Data Streams
97 pages
Streaming Algorithms Complete
No ratings yet
Streaming Algorithms Complete
10 pages
22amh32 - Data Analytics and Data Science Unit Iii & Counting Ones in Awindow 1. Counting Ones in A Window
No ratings yet
22amh32 - Data Analytics and Data Science Unit Iii & Counting Ones in Awindow 1. Counting Ones in A Window
6 pages
Bda Que1
No ratings yet
Bda Que1
1 page
The General Expiration Streaming Model
No ratings yet
The General Expiration Streaming Model
46 pages
SPA Session 14 15 CMS HyperLog
No ratings yet
SPA Session 14 15 CMS HyperLog
23 pages
B.tech Bloom Filter 3
No ratings yet
B.tech Bloom Filter 3
14 pages

Module 4

Uploaded by

Module 4

Uploaded by

https://chauff.github.io/documents/bdp-quiz/streaming.

Module 4: Mining Data Streams: 23% weightage

Course Outcomes: Learner will be able to…

Example1: Consider Amazon online transactions. For every

• DGIM is Datar-Gionis-Indynk-Motwani Algorithm.

010110001 011101 1001 011 0 1

Advantages of Decaying Window Algorithm:

You might also like