Bda Unit II Lecture2

This document discusses techniques for sampling data streams to obtain unbiased estimates of metrics. It describes how hashing stream elements to buckets based on their values rather than positions allows obtaining samples that accurately represent the stream metrics. Specifically, it covers sampling unique elements, controlling sample size, and sampling key-value pairs to estimate averages across keys.

Uploaded by

Anju2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views10 pages

Bda Unit II Lecture2

Uploaded by

Anju2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 10

More Stream Mining

Sampling Streams
Bloom Filters
Counting Distinct Items
Computing Moments

Reference: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

Cambridge University Press, Second Edition, 2014.
http://www.mmds.org
Sampling a Stream
What Doesn’t Work
Sampling Based on Hash Values
When Sampling Doesn’t Work
 Suppose Google would like to examine its
stream of search queries for the past month to
find out what fraction of them were unique –
asked only once.
 But to save time, we are only going to sample
1/10th of the stream.
 The fraction of unique queries in the sample !=
the fraction for the stream as a whole.
 In fact, we can’t even adjust the sample’s fraction to
give the correct answer.
3
Example: Unique Search Queries
 The length of the sample is 10% of the length of
the whole stream.
 Suppose a query is unique.
 It has a 10% chance of being in the sample.
 Suppose a query occurs exactly twice in the
stream.
 It has an 18% chance of appearing exactly once in
the sample.
 And so on … The fraction of unique queries in
the stream is unpredictably large.
4
Sampling by Value
 My mistake: I sampled based on the position
in the stream, rather than the value of the
stream element.
 The right way: hash search queries to 10
buckets 0, 1,…, 9.
 Sample = all search queries that hash to
bucket 0.
 All or none of the instances of a query are selected.
 Therefore the fraction of unique queries in the
sample is the same as for the stream as a whole.
5
Controlling the Sample Size
 Problem: What if the total sample size is
limited?
 Solution: Hash to a large number of buckets.
 Adjust the set of buckets accepted for the
sample, so your sample size stays within
bounds.

6
Example: Fixed Sample Size
 Suppose we start our search-query sample at
10%, but we want to limit the size.
 Hash to (say) 100 buckets, 0, 1,…, 99.
 Take for the sample those elements hashing to
buckets 0 through 9.
 If the sample gets too big, get rid of bucket 9.
 Still too big, get rid of 8, and so on.

7
Sampling Key-Value Pairs
 This technique generalizes to any form of data
that we can see as tuples (K, V), where K is the
“key” and V is a “value.”
 Distinction: We want our sample to be based on
picking some set of keys only, not pairs.
 In the search-query example, the data was “all key.”
 Hash keys to some number of buckets.
 Sample consists of all key-value pairs with a key
that goes into one of the selected buckets.

8
Example: Salary Ranges
 Data = tuples of the form (EmpID, Dept, Salary).
 Query: What is the average range of salaries
within departments?
 Key = Dept.
 Value = (EmpID, Salary).
 Sample picks some departments, has salaries
for all employees of that department, including
its min and max salaries.
 Result will be an unbiased estimate of the
average salary range.
9
References
 Jure Leskovec, Anand Rajaraman, Jeff Ullman,
Mining of Massive Datasets, Cambridge
University Press, Second Edition, 2014.
 http://mmds.org/

Unit-2 Advance Concept of Model. Notes
No ratings yet
Unit-2 Advance Concept of Model. Notes
15 pages
Ai Project Cycle Class X
No ratings yet
Ai Project Cycle Class X
23 pages
Muskingum Routing Example & Parameter Estimation
100% (1)
Muskingum Routing Example & Parameter Estimation
11 pages
Datamining Lect2
No ratings yet
Datamining Lect2
49 pages
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
No ratings yet
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
46 pages
Unit Ii BD
No ratings yet
Unit Ii BD
74 pages
Mining Data Streams (Part 1) : Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
No ratings yet
Mining Data Streams (Part 1) : Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
46 pages
Module 3 Mining Data Streams
No ratings yet
Module 3 Mining Data Streams
96 pages
Mining Data Streams 1
No ratings yet
Mining Data Streams 1
46 pages
Streams 2
No ratings yet
Streams 2
49 pages
Mod2 Data Streams
No ratings yet
Mod2 Data Streams
75 pages
Mining Data Streams (Part 2) : Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
No ratings yet
Mining Data Streams (Part 2) : Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
46 pages
Unit 2
No ratings yet
Unit 2
23 pages
Mining Data Streams (Part 1)
No ratings yet
Mining Data Streams (Part 1)
46 pages
Ch05a Streams1
No ratings yet
Ch05a Streams1
48 pages
BDA Unit-2
No ratings yet
BDA Unit-2
12 pages
Module 3 Mining Data Streams
No ratings yet
Module 3 Mining Data Streams
97 pages
Mining Data Stream
No ratings yet
Mining Data Stream
31 pages
MMD 03
No ratings yet
MMD 03
53 pages
Mmd04A Streams
No ratings yet
Mmd04A Streams
78 pages
Data Science 5
No ratings yet
Data Science 5
82 pages
16 Streams
No ratings yet
16 Streams
61 pages
DA Unit 3
No ratings yet
DA Unit 3
12 pages
02 StreamsAlgorithms
No ratings yet
02 StreamsAlgorithms
93 pages
Streams 1
No ratings yet
Streams 1
33 pages
CSE545 Sp23 (2) Streaming Algorithms 2-4
No ratings yet
CSE545 Sp23 (2) Streaming Algorithms 2-4
60 pages
ch04 Streams1
No ratings yet
ch04 Streams1
4 pages
Bda Unit II Lecture1
No ratings yet
Bda Unit II Lecture1
10 pages
Bda Unit3
No ratings yet
Bda Unit3
22 pages
Datamining Lect2
No ratings yet
Datamining Lect2
28 pages
Lecture 27
No ratings yet
Lecture 27
21 pages
Unit2 Bda
No ratings yet
Unit2 Bda
293 pages
Intro Part2
No ratings yet
Intro Part2
50 pages
Big Data Analytics - Unit 3
No ratings yet
Big Data Analytics - Unit 3
64 pages
RTDS Unit-5
No ratings yet
RTDS Unit-5
27 pages
Big Data Unit III
No ratings yet
Big Data Unit III
20 pages
Week3 - Mining Data Streams
No ratings yet
Week3 - Mining Data Streams
38 pages
Data Mining: Preprocessing & Sampling
No ratings yet
Data Mining: Preprocessing & Sampling
46 pages
Data Stream Algorithms Primer
No ratings yet
Data Stream Algorithms Primer
76 pages
Bloom Filters & Stream Algorithms
No ratings yet
Bloom Filters & Stream Algorithms
4 pages
Mining Data Streams
No ratings yet
Mining Data Streams
34 pages
Swe2011 Bda - III
No ratings yet
Swe2011 Bda - III
50 pages
Book 160 163
No ratings yet
Book 160 163
4 pages
The Traveling Salesman Problem and Its Variations
100% (1)
The Traveling Salesman Problem and Its Variations
836 pages
Data Stream Sampling
No ratings yet
Data Stream Sampling
25 pages
DSBD Unit-II 3
No ratings yet
DSBD Unit-II 3
28 pages
Mining Data Streams (Part 2)
No ratings yet
Mining Data Streams (Part 2)
56 pages
Big Data Unit Ii Notes
No ratings yet
Big Data Unit Ii Notes
19 pages
BDA Unit 2
No ratings yet
BDA Unit 2
24 pages
Mining Techniques For Streaming Data
No ratings yet
Mining Techniques For Streaming Data
14 pages
Methodologies For Stream Data Processing and Stream Data Systems
No ratings yet
Methodologies For Stream Data Processing and Stream Data Systems
20 pages
Stream Processing
No ratings yet
Stream Processing
70 pages
Unit 3
No ratings yet
Unit 3
30 pages
Unit 3
No ratings yet
Unit 3
49 pages
Automatic Grading with Machine Learning
No ratings yet
Automatic Grading with Machine Learning
10 pages
Hash-Based Indexing Techniques
No ratings yet
Hash-Based Indexing Techniques
15 pages
Data Stream Unit4
No ratings yet
Data Stream Unit4
20 pages
B.tech Bloom Filter 3
No ratings yet
B.tech Bloom Filter 3
14 pages
Northwest Corner Method
No ratings yet
Northwest Corner Method
8 pages
Unit-II (Big Data)
No ratings yet
Unit-II (Big Data)
20 pages
Regression: Unit Iii
No ratings yet
Regression: Unit Iii
54 pages
1.1. Formulation of LPP: Chapter One: Linear Programming Problem/LPP
No ratings yet
1.1. Formulation of LPP: Chapter One: Linear Programming Problem/LPP
54 pages
Slides Presentation
No ratings yet
Slides Presentation
106 pages
Data Stream Processing Insights
No ratings yet
Data Stream Processing Insights
67 pages
Econometrics1 Syllabus Handout
No ratings yet
Econometrics1 Syllabus Handout
3 pages
Deep RL Tutorial Small
No ratings yet
Deep RL Tutorial Small
66 pages
6 Uninformed Search
No ratings yet
6 Uninformed Search
13 pages
Classical Dynamics & Thermodynamics
No ratings yet
Classical Dynamics & Thermodynamics
30 pages
6 DUALITY Theory
No ratings yet
6 DUALITY Theory
25 pages
03 - Classification PDF
No ratings yet
03 - Classification PDF
92 pages
Unit-II BDA
No ratings yet
Unit-II BDA
19 pages
Mathematics SS2 3RD Term
No ratings yet
Mathematics SS2 3RD Term
29 pages
Notes 08 - Nichols Charts & Closed Loop Performance
No ratings yet
Notes 08 - Nichols Charts & Closed Loop Performance
3 pages
9-Biotonic Sort
No ratings yet
9-Biotonic Sort
25 pages
Stable Marriage Problem and Sudoko
No ratings yet
Stable Marriage Problem and Sudoko
25 pages
ETD Syllabus
No ratings yet
ETD Syllabus
2 pages
Ca 1
No ratings yet
Ca 1
24 pages
Data Stream Sampling Techniques
No ratings yet
Data Stream Sampling Techniques
3 pages
Introduction To Power System Reliability Evaluation: Availability (AV) and Forced Outage Rate (FOR)
No ratings yet
Introduction To Power System Reliability Evaluation: Availability (AV) and Forced Outage Rate (FOR)
9 pages
Robust Filtered Smith Predictor For Processes With Time - 2020 - European Journ
No ratings yet
Robust Filtered Smith Predictor For Processes With Time - 2020 - European Journ
13 pages
Introduction To Minor Programme 2021
No ratings yet
Introduction To Minor Programme 2021
9 pages
Learning To Detect Violent Videos Using Convolutio
No ratings yet
Learning To Detect Violent Videos Using Convolutio
7 pages
แบบฝึกหัดการวิเคราะห์อัลกอริทึม (ซูโดโค้ด)
No ratings yet
แบบฝึกหัดการวิเคราะห์อัลกอริทึม (ซูโดโค้ด)
5 pages
696643232M.A Session 1 Chapter 3
No ratings yet
696643232M.A Session 1 Chapter 3
2 pages
Short Review of Tony Hutchins' Book "Modern Financial Computation"
No ratings yet
Short Review of Tony Hutchins' Book "Modern Financial Computation"
1 page
Standard Algorithm Names
No ratings yet
Standard Algorithm Names
24 pages
Prajwal. K
No ratings yet
Prajwal. K
31 pages

Bda Unit II Lecture2

Uploaded by

Bda Unit II Lecture2

Uploaded by

More Stream Mining

Reference: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

You might also like