Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
7 views145 pages

Module3A MiningBigDataStreams

Mining of Big Data

Uploaded by

Siddhant Jha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views145 pages

Module3A MiningBigDataStreams

Mining of Big Data

Uploaded by

Siddhant Jha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 145

Big Data Analytics 2023

Module 3A –Mining Big


Data Streams
Overview
 The Stream Data Model:  Counting Distinct
 A Data-Stream-Management Elements in a Stream
System  The Count-Distinct
 Examples of Stream Sources Problem
 Stream Queries  The Flajolet-Martin
 Issues in Stream Processing. Algorithm
 Sampling Data in a Stream  Counting Ones in a
 Sampling Techniques. Window:
 Filtering Streams  The Cost of Exact
 The Bloom Filter Counts
 The Datar-Gionis-Indyk-
Motwani Algorithm

2
Motivating Examples

3
4
Achieve real time customer
intelligence
• For many enterprises,
high performance
clickstream
processing is a vital
business function.
• From websites to
mobile devices, we
need to capture and
immediately process
customer interactions.
• Results in real-time
business intelligence,
reporting,
personalization, and
dynamic pricing.
5
Process events from IoT devices

• Smart factories, connected cars, smart cities, and other IoT devices generate a lot
of data continuously.
• Stream processing solution captures IoT data streams and processes them in a
centralized way
6
Immediately detect & prevent
suspicious activity
• In any type of
financial transaction
fraud or ad fraud,
streaming is
necessary to detect
and automatically
react to suspicious
activity in real-time.
• We can combine
state of the art
machine learning
anomaly detection
algorithms with
high performance
stream processing
engines to prevent
fraud.
7
Data Streams (Streaming Data)
 Streaming data refers to data that
is continuously generated, usually
in high volumes and at high velocity.
 A streaming data source would
typically consist of a stream of logs
that record events as they happen
 user clicking on a link in a web page
 sensor reporting the current
temperature.
 Common examples of streaming data A data stream is a
include: constant flow of
 IoT sensors data, which updates
 Server and security logs with high frequency
 Real-time advertising and loses its
 Click-stream data from apps and relevance in a short
websites 8
Streams – A New

Model
Traditional DBMS: data stored in finite, persistent
data sets
 Data Streams: distributed, continuous, unbounded,
rapid, time varying, noisy, . . .
 Data-Stream Management: variety of modern
applications.
– Network monitoring and traffic engineering
– Sensor networks
– Telecom call-detail records
– Network security
– Financial applications
– Manufacturing processes
– Web logs and clickstreams
– Other massive data sets…
9
Typical Applications
 Heavy machinery/transportation/fleet operations:
sourcing data streams from sensors and IoT devices;
 Healthcare: real-time monitoring of health-conditions,
clinical risk-assessment, client-state analysis, and alerts;
 Finance: transaction processing, market/currency state
monitoring;
 Retail/customer service: customer behavior analysis
and operations improvement;
 Manufacturing/supply chain: real-time monitoring,
predictive maintenance, disruption/risk assessment;
 Home security: IoT data stream analysis, smart
protection, and alert systems improvement;
 Security : CCTV footage
10
11
Uber: Chaperone Tool (KafKa)
 An international ride-hailing and food-delivery service.
 two real-time instances:
 Tracking location of drivers and clients requires constant
data flow and updates of geolocation, pushing this data
to both types of application users. This means that Uber
has to deal with petabytes of messages to keep track of
data flow.
 Constant financial flow coming from Uber users that
make payments directly through the application requires
monitoring. Financial operations mean there is a high
risk of fraud. So, in addition to the amount of controlling
streamed data, Uber also has to be on the alert with
fraud detection.
12
Human resources
 By applying streaming analytics to data streams,
such as email, time reporting apps, injury
reports, and other resources, managers can gain
deeper insights into behavioral patterns that may
suggest an employee is burning out due to
excessive hours or actively interviewing at other
companies.
 The insights can help HR professionals and line-
of-business managers proactively balance
workloads, offer more competitive
compensation, or provide training and
development to retain valued team members.
13
DBMS vs. DSMS #1

SQL Query Result Continuous Query (CQ) Result

Query Processing

Main Memory
Query Processing
Disk

Data Stream(s) Main Memory Data Stream(s)

14
DBMS vs. DSMS #2
DSMS:
Traditional DBMS: support on-line analysis
 stored sets of of rapidly changing data
relatively static streams
records with no pre- data stream: real-time,
defined notion of time continuous, ordered
 good for applications (implicitly by arrival time
that require persistent or explicitly by
data storage and timestamp) sequence
complex querying of items, too large to
store entirely, not
ending
continuous queries
15
DBMS vs. DSMS #3
DBMS DSMS
 Persistent relations • Transient streams
(relatively static, stored) (on-line analysis)
 One-time queries • Continuous queries
 Random access • Sequential access
 “Unbounded” disk store • Bounded main memory
 Only current state matters • Historical data is important
 No real-time services • Real-time requirements
 Relatively low update rate • multi-GB arrival rate
 Data at any granularity • Data at fine granularity
 Assume precise data • Data stale/imprecise
 Access plan determined by • Unpredictable/variable
query processor, physical DB data arrival and
design characteristics

Adapted from [Motawani: PODS tutorial]


16
The Stream Model - 1
 The data model and query semantics must
allow order-based and time-based operations
 The inability to store a complete stream
indicates that some approximate summary
structures must be used. As a result, queries
over the summaries may not return exact
answers.
 Streaming query plans may not use any
blocking operators that require the entire
input before any results are produced.
17
The Stream Model - 2
 Storage and performance constraints make
backtracking over a data stream infeasible.
 On-line stream algorithms are restricted to
making only one pass over the data.
 Applications that monitor streams in real-
time must react quickly to unusual data
values.
 Parallel and shared execution of many
continuous queries must be possible.
18
Generic DSMS
Architecture

Working

Query Processor
Storage

Input Summary Output


Monitor Storage Buffer
Query
Static
Reposi-
Storage
tory
Streaming Streaming
Inputs Outputs
Updates to User
Static Data Queries

19
General Stream Processing Model
Ad-Hoc
Queries

. . . 1, 5, 2, 7, 0, 9, 3 Standing
Queries
. . . a, r, v, t, y, h, b Output
Processor
. . . 0, 0, 1, 0, 1, 1, 0
time

Streams Entering.
Each is stream is
composed of
elements/tuples
Limited
Working
Storage Archival
Storage

20
Data Streams Mining
 Definition
 Continuous, unbounded, rapid, time-varying
streams of data elements
 Application Characteristics
 Massive volumes of data (can be several
terabytes)
 Records arrive at a rapid rate
 Goal
 Mine patterns, process queries and compute
statistics on data streams in real-time
21
Streaming Analytics
 Big data streaming is a process in which large streams of
real-time data are processed with the sole aim of
extracting insights and useful trends out of it.
 A continuous stream of unstructured data is sent for
analysis into memory before storing it onto disk.
 This happens across a cluster of servers.
 Speed matters the most in big data streaming. The value
of data, if not processed quickly, decreases with time.
 Real-time streaming data analysis is a single-pass
analysis. Analysts cannot choose to reanalyze the data
once it is streamed.

22
Data stream Mining – Challenges
 Mining big data streams - In addition to Volume,
Velocity, Variety also Volatility.
 Volatility - dynamic environment with ever-changing
patterns.
 Concept drift is a phenomenon that occurs when
the distributions of features x and target variables y
change in time.
 As data streams have no beginning or end, they can’t
be broken into batches. So there is no time when the
data can be uploaded into storage and processed.
Instead, data streams are processed on the fly.
23
Challenges
 Single pass: each record is examined at most once
(random access not possible)
 Bounded storage: limited memory for storing synopsis
 Real-time: per record processing time (to maintain
synopsis) must be low
 As new data arrives older data discarded to make room for
subsequent examples.
 The algorithm processing the stream has no control over the
order of the examples seen, and must update its model
incrementally as each example is inspected.
 An additional desirable property, the so-called anytime
property, requires that the model is ready to be applied at any
point between training examples
 Generally, algorithms compute approximate answers
24
Applications (1)
 Mining query streams
 Google wants to know what queries are
more frequent today than yesterday

 Mining click streams


 Yahoo wants to know which of its pages are getting
an unusual number of hits in the past hour

 Mining social network news feeds


 E.g., look for trending topics on Twitter, Facebook

25
Applications (2)
 Sensor Networks
 Many sensors feeding into a central controller
 Telephone call records
 Data feeds into customer bills as well as
settlements between telephone companies
 IP packets monitored at a switch
 Gather information for optimal routing
 Detect denial-of-service attacks

26
Interesting Use Cases
Customer experience and interaction
 When a customer visits a retail website, their
website movements get tracked, as well as
their purchasing preferences and choices.
 Recognizing and responding to consumer
buying patterns in real time can be integrated
into marketing so a customer looking at slacks
could suddenly see an offer for matching
shirts.

27
Interesting Use Cases
Environmental sensing
 Ambient temperatures in data centers, conference
rooms, offices, warehouses, refrigeration units,
hospital operating rooms, etc., are all part of
providing viable work space.
 If a sensor in a room suddenly detects temperatures
or humidities that are falling outside of range, an
auto-alert can be sent, and a maintenance person
can be dispatched.
 The technology can save lives, prevent food spoilage
, and keep data centers running.
28
Interesting Use Cases
Systems geo-tracking and security
 Who's tapping into your systems and
networks, when and from where are all
important elements of security and
governance.
 So is the ability to track foot traffic through
plants and offices.
 In 2020, IBM reports that the
cost of a single data breach was $3.8 million,
so the savings in cost and company reputation
are significant. 29
Interesting Use Cases
Industrial IoT
 A machine failure on an assembly line can cost $1 million per day.
 That failure can be prevented by an industrial sensor that can detect a
machine failing in real time before it happens.
 This kind of preemptive maintenance keeps assembly lines running and
saves millions of dollars.

Logistics

 Logistics companies track trucks and cars on the road with IoT sensors.
 They are able to see which vehicles will arrive on time, or ahead of, or
behind schedule.
 They can observe vehicle proximity and reroute a vehicle if another vehicle
in the area suffers a breakdown.
 All of this is facilitated with IoT devices and sensors attached to vehicles
that are monitored in real time.
 The savings can mount up. For refrigerated trucks alone, the
late fee for one load of cargo can be $500.
30
Interesting Use Cases
Patient monitoring
 Healthcare clinics and hospitals can now
automatically receive vitals readings from patients
who are at home.
 Alerts are issued if a patient's data indicates a
dangerous condition.
Fraud detection
 In a flash, a bank card processor can detect a
fraudulent credit card transaction as soon as the
perpetrator passes the card through a card reader.
 The transaction gets denied, and no money is lost.

31
Overview Data Stream Processing

32
Data Stream Queries -- Types

Answer availability
 One-time
 Multiple-time
 Continuous (“standing”), stored or
streamed
 Join queries
Registration time
 Predefined
 Ad hoc
33
Stream Queries Issues
- 1
 Unbounded Memory Requirements
 Approximate Query Answering
 data reduction and synopsis construction
 Sketches, random sampling, histogram & wavelets
 Sliding Windows : Storing only recent
elements
 SELECT AVG(S.minutes)
 FROM Calls S [PARTITION BY S.customer id
 ROWS 10 PRECEDING
 WHERE S.type = ’Long Distance’]

34
Stream Queries Issues
- 2
 Join Queries
 return the average length of the last 1000 telephone calls
placed by “Gold” customers:
 SELECT AVG(V.minutes)
FROM (SELECT S.minutes
FROM Calls S, Customers T
WHERE S.customer id = T.customer id
AND T.tier = ’Gold’)
V [ROWS 1000 PRECEDING]

 Notice that in this example, the stream of calls must be


joined to the Customers relation before applying the
sliding window.
35
Stream Queries Issues
- 3
 Batch Processing, Sampling, and Synopses
 In batch processing, rather than producing a continually
up-to-date answer, the data elements are buffered as
they arrive, and the answer to the query is computed
periodically.
 In sampling some data points must be skipped
altogether, so that the query is evaluated over a sample
of the data stream rather than over the entire data
stream.
 can often design an approximate data structure that
maintains a small synopsis or sketch of the data rather
than an exact representation, and therefore is able to
keep computation per data element to a minimum.
36
Stream Queries Issues
- 4
 Blocking Query Operator
 is a query operator that is unable to produce
an answer until it has seen its entire input.
Sorting is an example of a blocking operator,
as are aggregation operators such as SUM,
COUNT, MIN, MAX, and AVG.
 Dealing with them effectively is one of the
challenges of data stream computation.

37
Stream Query
processing Issues

38
Data Streams
Operations
 Sampling data from a stream
 Filtering a data stream
 Select elements with property x from the
stream
 Counting distinct elements
 Number of distinct elements in the last k
elements
of the stream
 Counting no of 1s in a window
39
Sampling
 Sampling is a common practice for
selecting a subset of data to be analyzed.
 We select instances at periodic intervals.
 Sampling is used to compute statistics
(expected values) of the stream.
 The main problem is to obtain a
representative sample, a subset of data
that has approximately the same
properties of the original data.
40
Sampling from a Data
Stream
 Since we can not store the entire stream,
one obvious approach is to store a sample
 Two different problems:
 (1) Sample a fixed proportion of elements
in the stream (say 1 in 10)
 (2) Maintain a random sample of fixed size
over a potentially infinite stream
 At any “time” k we would like a random sample
of s elements
 What is the property of the sample we want to maintain?
For all time steps k, each of k elements seen so far has
equal prob. of being sampled

41
Maintaining a fixed-size
sample
 Suppose we need to maintain a random
sample S of size exactly s tuples
 E.g., main memory size constraint
 Why? Don’t know length of stream in advance
 Suppose at time n we have seen n items
 Each item is in the sample S with equal prob. s/n
How to think about the problem: say s = 2
Stream: a x c y z k c d e g…
At n= 5, each of the first 5 tuples is included in the sample S with equal
prob.
At n= 7, each of the first 7 tuples is included in the sample S with equal
prob.
Impractical solution would be to store all the n tuples
seen so far and out of them pick s at random
42
Solution: Fixed Size
Sample
 Reservoir Sampling
 Store all the first s elements of the stream to S
 Suppose we have seen n-1 elements, and now
the nth element arrives (n > s)
 With probability s/n, keep the nth element, else discard it
 If we picked the nth element, then it replaces one of the
s elements in the sample S, picked uniformly at random

 Claim: This algorithm maintains a sample S


with the desired property:
 After n elements, the sample contains each element
seen so far with probability s/n
43
Reservoir Sampling

44
Decision tree for the reservoir sampling algorithm of
stream a1,a2,a3,a4.

The algorithm randomly decides to take or not to take


(“yes” or “no”) the considered input item as the actual
sample s.
Each node is labeled by the probability of the previous
decision and its color indicates the item currently
chosen as s.
The probability of a specific path through this tree
results from multiplying the probabilities along this
path.

45
46
Reservoir Sampling -
Issues

47
Reservoir Sampling -
Issues

48
Possible Solutions

49
Biased Reservoir
Sampling

50
Biased Reservoir
Sampling
 A bias function to regulate the sampling from
the stream.
 This bias gives a higher probability of
selecting data points from recent parts of the
stream as compared to distant past.
 This bias function is quite effective since it
regulates the sampling in a smooth way so
that queries over recent horizons are more
accurately resolved.

51
Concise Sampling
 Note that the size of the reservoir is sometimes
restricted by the available main memory.
 The method of concise sampling exploits the fact that
the number of distinct values of an attribute is often
significantly smaller than the size of the data stream.
 The sample is maintained as a set S of <value, count>
pairs.
 For those pairs in which the value of count is one, we
do not maintain the count explicitly, but we maintain
the value as a singleton.
 The number of elements in this representation is
referred to as the footprint
52
Concise Sampling
 Duplicates in sample S stored as <value,
count> pairs (thus, potentially boosting actual
sample size)
 Add each new element to S with probability 1/T
(simply increment count if element already in S)
 If sample size exceeds M
 Select new threshold T’ > T
 Evict each element (decrement count) from S with
probability 1-T/T’
 Add subsequent elements to S with probability 1/T’

53
Concise-Sampling
Example
 Dataset
 D = { a, a, a, b, b, b }
 Footprint
 F = one <value, #> pair
 Three (possible) samples of size = 3
 S1 = { a, a, a }, S2 = { b, b, b }, S3 = { a, a, b }.
 S1 = {<a,3>}, S2 = {<b,3>}, S3 = {<a,2>,<b,1>}.
 Three samples should have with equal likelihood
 But Prob(S1) = Prob(S2) > 0 and Prob(S3) = 0
 In general:
 Concise sampling under-represents ‘rare’ population
elements

54
Filtering data
Review: Bloom Filters

 Given a set S = {x1,x2,x3,…xn} on a


universe U, want to answer queries of
the form:
Is yS.
 Bloom filter provides an answer in
 “Constant” time (time to hash).
 Small amount of space.
 But with some probability of being wrong.

56
Bloom Filter
 A space-efficient probabilistic data
structure, conceived
by Burton Howard Bloom, used to test membership
 False positive possible, but no false negatives
(100% recall rate).
 Query returns "possibly in set" or "definitely not in
set".
 Elements can be added to the set, but not removed
 More elements that are added to the set, the larger
the probability of false positives
 Uses considerably less space than any exact method,
but pays for this by introducing a small probability of
error
57
First Example Scenario
 Suppose you are creating an account on Goodreads,
you want to enter a cool username, you entered it
and got a message, “Username is already taken”.
 You added your birth date along username, still no
luck. Now you have added your university roll
number also, still got “Username is already taken”.
It’s really frustrating, isn’t it?
 But have you ever thought how quickly Goodreads
checks the availability of username by searching
millions of username registered with it.
 Linear search : Bad idea!
 Binary Search : Way too much work Sorting etc !!
58
Scenario Contd.
 Bloom Filter is a data structure that can do this job.
 What is Bloom Filter?
 A Bloom filter is a space-efficient probabilistic data
structure that is used to test whether an element is a
member of a set.
 For example, checking availability of username is set
membership problem, where the set is the list of all
registered username.
 The price we pay for efficiency is that it is probabilistic in
nature that means, there might be some False Positive
results.
 False positive means, it might tell that given username
is already taken but actually it’s not.
59
Example - Google Chrome
 Chrome needs to store a blacklist of dangerous URLs.
 Each time a user is about to navigate to new page, need
to check against the blacklist
 Size of black list – around a million; Min and max length
of a URL is 2 -2083.
 Thus, we may need disk accesses to determine whether
the current URL is to be allowed or not.
 Considering the large numbers of users accessing
websites using Chrome continuously, this is indeed a
data steam with very high velocity
 Algorithm will use main mem only and yet will filter
most of the undesired stream elements.
60
Motivating Example
 Let us assume we want to use about one megabyte of
available main memory to store the blacklist.
 Bloom filter uses the main memory as a bit array being able
to store eight million bits.
 A hash function “h” that maps each URL in the blacklist to
one of eight million buckets. That corresponding bit in the
bit array is set to 1. All other bits of the array remain 0.
 It is possible that two URLs could hash to the same bit.
 When a stream element arrives, we hash its URL. If the bit
to which it hashes is 1, then we need to further check
whether this URL is safe for browsing or not.
 But if the URL hashes to a 0, then definitely address is not
in the blacklist so we can ignore this stream element.
61
Simple Example

62
Bloom Filters
 Bloom filters compactly encode set
membership
 k hash functions map items to bit vector k times
 Set all k entries to 1 to indicate item is present
 Can lookup items, store set of size n in ~ 2n bit

item

1 1 1
63
3 Hash Function Bloom
Filter

64
Bloom Filter Overview
 Both insertions and membership queries should be
performed in constant time.
 A Bloom filter is a bit vector B of m bits, with k
independent hash functions that map each key in U
to the set .
 We assume that each hash function maps a
uniformly at random chosen key to each element of
with equal probability.
 Since we assume the hash functions are
independent, it follows that the vector is equally
likely to be any of the k-tuples of elements from
65
Algorithm
 Initially all m bits of B are set to 0.
 Insert x into S.
 Compute
 Set B[] = B[] = ... = B[] = 1.

 Query if x ∈ S.
 Compute .
 If B[] = B[] = .. = B[] = 1
 then answer Yes, else answer No.
66
Bloom Filters
Start with an m bit array, filled with 0s.

B 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Hash each item xj in S k times. If Hi(xj) = a, set B[a] = 1.

B 0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0

To check if y is in S, check B at Hi(y). All k values must be 1.

B 0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0

Possible to have a false positive; all k values are 1, but y is not in S.

B 0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0

n items m = cn bits k hash functions


67
Tradeoffs

Three parameters.
 Size m/n : bits per item.
 Time k : number of hash functions.
 Error f : false positive probability.

68
Overview
 At the heart of every bloom filter lies two key elements
 An array of n bits, initially all set to 0.
 A collection of k independent hash functions h(x). Each hash function
takes a value v and generates a number i where i < n which effectively
maps to a position in the bit array.
 The underlying idea of a bloom filter is quite simple
 Initialize bit array of n bits with 0s. n >> no. of elements in the set.
 Whenever the filter sees a new element apply each of the hash
functions h(x) on the element. With the value generated, which is an
index in the bit array, set the bit to 1 in the array.
 If there are k hash functions there will be k indices generated. For each
of these k positions in the bit array set array[i] = 1
 To check if element is in the set, the same procedure with a twist.
 Generate k values by applying the k hash-functions on the input. If at
least one of these k indices in the bit array is set to zero then the
element is a new element else this is an existing element in the set.
69
70
Interesting Properties of Bloom
Filters
 Unlike a standard hash table, a Bloom filter of a fixed size can
represent a set with an arbitrarily large number of elements.
 Adding an element never fails. However, the false positive
rate increases steadily as elements are added until all bits in
the filter are set to 1, at which point all queries yield a
positive result.
 Bloom filters never generate false negative result, i.e., telling
you that a username doesn’t exist when it actually exists.
 Deleting elements from filter is not possible because, if we
delete a single element by clearing bits at indices generated
by k hash functions, it might cause deletion of few other
elements.

71
Some More Examples
 The servers of Akamai Technologies, a content delivery provider,
use Bloom filters to prevent "one-hit-wonders" from being stored
in its disk caches.
 One-hit-wonders are web objects requested by users just once,
something that Akamai found applied to nearly three-quarters of
their caching infrastructure.
 Using a Bloom filter to detect the second request for a web object
and caching that object only on its second request prevents one-
hit wonders from entering the disk cache, significantly reducing
disk workload and increasing disk cache hit rates.
 Google Bigtable, Apache HBase and Apache Cassandra,
and Postgresql use Bloom filters to reduce the disk lookups for
non-existent rows or columns. Avoiding costly disk lookups
considerably increases the performance of a database query
operation
72
Some Examples
 The Squid Web Proxy Cache uses Bloom filters for cache digests.
 Bitcoin uses Bloom filters to speed up wallet synchronization.
 The Venti archival storage system uses Bloom filters to detect
previously stored data.
 The SPIN model checker uses Bloom filters to track the
reachable state space for large verification problems.
 The Cascading analytics framework uses Bloom filters to speed
up asymmetric joins, where one of the joined data sets is
significantly larger than the other (often called Bloom join in the
database literature).
 Prime Video uses Bloom filters effectively to avoid duplicate
recommendations
 Medium uses Bloom filters to avoid recommending articles a
user has previously read.
 Ethereum uses Bloom filters for quickly finding logs on the
Ethereum blockchain
73
Examples
 Joins on distributed relations
 Spell Check
 Weak Password Dictionary - Store dictionary of easily
guessable passwords as bloom filter, query when users pick
passwords.
 Virus Signature detection
 Inventory checks
 Any Unique Identification System has to generate a unique number
for newly registered users. If the number of user registrations
increase dramatically, checking with the database is too expensive.
In this case, a bloom filter can tell if a number has already been
generated or not. If yes, simply generate a new random number
and check with the filter again. Keep doing this till the bloom filter
returns false.
74
Popular hash algorithms
 DJB2
 DJB2a (variant using xor rather than +)
 FNV-1 (32-bit)
 FNV-1a (32-bit)
 SDBM
 CRC32
 Murmur2 (32-bit)
 SuperFastHash

75
Example
 In this section, we present an example using Bloom filters.
We assume having an array of 10 bits that are all set to 0.
Also, we assume using two simple hash functions:

 1) h 1(x) = x mod 10

 2)h 2(x)=(5x+4)mod10

 Initially, we have the following (empty) bloom filter:

0 1 2 3 4 5 6 7 8 9

0 0 0 0 0 0 0 0 0 0
76
Insertion
 To insert 19 in bloom filter, we compute the
digests of h1, h2 :
 h1(19) = 19 mod 10 = 9
 h2(19)= (5*19+4)mod 10 = 99mod 10 = 9
 Then, the bit in 9th position of the filter is set
to 1. After insertion of 19 the filter is :

77
Insert
 Similarly, to insert 132 we compute
 H1 (132) = 2
 H2 (132) = 4.
 Then the bits in positions 2 and 4 are set to 1.
The filter now is :

78
Insert
 Finally, regarding insertion of 25, digests of
hash functions are
 h1(25) = 5
 h2(25) = 9
 Bit 5 is set 1. Bit 9 is already 1, as it has been
set by the insertion of element 19. Bloom
filter after the insertion of 25 is :

79
Existence Checks
 Now, we check using the previously formed Bloom filter for the
existence of the elements 133, 25 and 24 in the set A.
 To check if element 133 exists in A, we first compute the digests of
h1, h2: h1(133) = 3 h2(133) = 9
 Then we check whether the bits of positions 3 and 9 of the Bloom
filter are set to 1. Although bit 9 is set, the bit in position 3 is 0. As a
result, the filter returns NO.
 To check if element 25 exists in A, we compute h1(25) = 5 and
h2(25) = 9. Then we check whether the bits of positions 5 and 9 of
the Bloom filter are set to 1. Indeed, both bits are 1. So, Bloom filter
returns YES. It is a true positive, as 25 exists in the set.
 To check if element 24 exists in A, we compute h1(24) = 24 mod 10
= 4 & h 2(24) = 124 mod 10 = 4 . Then we check whether the bits of
position 4 is set to 1. Indeed, it is set to 1. Although Bloom filter
return again YES, element 24 does not exist in the set (resulting in a
false positive).
80
Example

81
Using the Bloom Filter

82
Sketches
 Not every problem can be solved with sampling
 Example: counting how many distinct items in the stream
 If a large fraction of items aren’t sampled, don’t know if they are
all same or all different
 Other techniques take advantage that the algorithm can
“see” all the data even if it can’t “remember” it all
 “Sketch”: essentially, a linear transform of the input
 Model stream as defining a vector, sketch is result of multiplying
stream vector by an (implicit) matrix

linear
projection

83
Counting
Distinct
Elements
in a
Stream
Problem Description
 Given a data stream of n insertions of records,
count the number F0 of distinct records
 One pass over the data stream
 Algorithms must use small amount of memory
and have fast update time
 too expensive to store set of distinct records
 implies algorithms must be randomized and
must settle for an approximate solution: output
 F Є [(1-ἐ)F0, (1+ ἐ)F0] with constant probability

85
Some Applications
 How many different words are found among
the Web pages being crawled at a site?
 Unusually low or high numbers could indicate
artificial pages (spam?).
 How many different Web pages does each
customer request in a week?

86
Simple Solution
 Keep an array, a[0, ..,U], initially set to 0.
 Also keep a counter C initialized to 0.
 Every time an item i arrives, look at a[i].
 If it is zero, increment C, and set a[i] = 1
 Return C as the number of distinct items
 Time: O(1) per update and per query
 But space is O(U).
 What happens if we do not have enough
memory to store all the distinct items?–
The Flajolet-Martin Sketch. 87
Using Small Storage
Real Problem: what if we do not
have space to store the complete
set?
Estimate the count in an unbiased
way.
Accept that the count may be in
error, but limit the probability that
the error is large.
88
Challenge
Example (N=64)
Data stream: 3 2 5 3 2 1 7 5 1 2 3
7 5
Number of distinct values:

Hard problem for random sampling!


Must sample almost the entire
table to guarantee the estimate is
within a factor of 10 with
probability > 1/2, regardless of
the estimator used!
89
Vector Interpretation

Stream: 8 2 1 9 1 9 2 4 4 9 4 2 5 4 2 5 8 5 2 5

Vector X:

1 2 3 4 5 6 7 8 9

 Initially, x=0
 Insertion of i
is interpreted as
xi = xi +1
 Want to estimate DE(x) = the number of
non-zero elements of x
90
The Flajolet-Martin
algorithm
 The Flajolet-Martin algorithm uses the position
of the rightmost set and unset bit to
approximate the count-distinct in a given stream.
 The two seemingly unrelated concepts are
intertwined using probability.
 It uses extra storage of order O(log
m) where m is the number of unique elements in
the stream and provides a practical estimate of
the cardinalities.

91
Basic Idea of FM
 The basic idea.
 Keep an array a[1.... log U]
 Use a hash function f : {1...U} → {0.... log U}
 Compute f (i) for every item in the stream,
and set a[f (i)] = 1.
 Somehow extract from this the approximate
number of distinct items.
 Space requirement=O(log U) = O(log N),
assuming hash functions do not require too
much of space.
92
Flajolet-Martin Approach - Intuition

 Suppose we had a good, random hash function that acted on strings


and generated integers, what can we say about the generated
integers? Since they are random themselves, we would expect:
 1/2 of them to have their binary representation end in 0 (i.e. divisible by 2
),
 1/4 of them to have their binary representation end in 00 (i.e. divisible by
4)
 1/8 of them to have their binary representation end in 000 (i.e. divisible
by 8 )
 and in general, 1/2^n of them to have their binary representation end in
0^n .
 Turning the problem around, if the hash function generated an
integer ending in 0^m bits (and it also generated integers ending in
0^{m-1} bits, 0^{m-2} bits, ..., 0^1 bits), intuitively, the number of
unique strings is around 2^m .
93
Intuition

In general, we can say, the probability of the rightmost set bit, in


binary presentation, to be at the position k in a uniform
distribution of numbers is

94
Intuition
 The probability of the rightmost set bit drops by a factor of 1/2 with
every position from the LSB to MSB
 So if we keep on recording the position of the rightmost set bit, ρ, for
every element in the stream (assuming uniform distribution) we should
expect ρ = 0 to be 0.5, ρ = 1 to be 0.25, and so on. This probability
should become 0 when bit position, b is b > log m while it should be non-
zero when b <= log m where m is the number of distinct elements in the
stream.
 Hence, if we find the rightmost unset bit position b such that the
probability is 0, we can say that the number of unique elements will
approximately be 2 ^ b. This forms the core intuition behind the Flajolet
Martin algorithm.

95
Flajolet-Martin Approach
 Pick a hash function h that maps each of the n
elements to at least log2n bits.
 For each stream element a, let r (a ) be the
number of trailing 0’s in h (a ).
 Record R = the maximum r (a ) seen.
 Estimate = 2R.

96
Simplified
 Create a bit vector (bit array) of sufficient length L , such
that 2L>n , the number of elements in the stream.
Usually a 64-bit vector is sufficient since 2{64} is quite
large for most purposes.
 The i-th bit in this vector/array represents whether we
have seen a hash function value whose binary
representation ends in 0i . So initialize each bit to 0.
 Generate a good, random hash function that maps input
(usually strings) to natural numbers.
 Read input. For each word, hash it and determine the
number of trailing zeros. If the number of trailing zeros is
k, set the k-th bit in the bit vector to 1.
97
Simplified
 Once input is exhausted, get the index of
the first 0 in the bit array (call this R). By
the way, this is just the number of
consecutive 1s (i.e. we have seen 0,
00, ..., as the output of the hash
function) plus one.
 Calculate the number of unique words
as
 2^R * Constant .
98
Simple Explanation
 We start with defining a closed hash range, big enough to hold the
maximum number of unique values possible - something as big as 2
^ 64.
 Every element of the stream is passed through a hash function that
permutes the elements in a uniform distribution.
 For this hash value, we find the position of the rightmost set bit and
mark the corresponding position in the bit vector as 1, suggesting
that we have seen the position.
 Once all the elements are processed, the bit vector will have 1s at
all the positions corresponding to the position of every rightmost
set bit for all elements in the stream.
 Now we find the position, b, of the rightmost 0 in this bit vector.
This position b corresponds to the rightmost set bit that we have
not seen while processing the elements.
 This corresponds to the probability 0 and hence as per the intuition
will help in approximating the cardinality as 2 ^ b.
99
FM Algorithm
Use r hash functions to create
r FM Sketches
 Initialize each FM to zero
 For each record x in dataset FM1 1 0 1 0
 For each hash function hi(x)
B1 = 1
 FM [pivot] = 1;
i

 Let Bi be the position of left FM2 1 1 0 0


most 0-bit of FMi
B2 = 2
 B = (B1 + B2 … + Br )/ r
Number of distinct elements =
α * 2B
FM3 1 1 0 1
where α = 1.2897385 B3 = 2

B = (1 + 2 + 2)/3 = 1.67
100
101
102
Examples
int X binary R(x)
format
 X=10=(1010)2 0 0000 4 (=L)
1 0001 0
 bit(y,0)=0 bit(y,1)=1
2 0010 1
bit(y,2)=0 bit(y,3)=1 3 0011 0
4 0100 2
5 0101 0
6 0110 1
7 0111 0
8 1000 3

103
Flajolet-Martin Approach – Estimate
Example

 Part of a Unix manual file M of size


26692 lines is loaded of which 16405 are
distinct.
 If the final BITMAP looks like this:
0000,0000,1100,1111,1111,1111
 The left most 1 appears at position 15
 We say there are around 215 distinct
elements in the stream. But 214 = 16384.
 Estimate where is the correction factor.
104
105
Example

106
Variations of F-M
Algorithm
 Take the mean of the k results together from each hash-function,
obtaining a single estimate of the cardinality.
 A different idea is to use the median which is less prone to be
influences by outliers.
 Another problem with this is that the results can only take
form as some power of 2
 A common solution is to combine both the mean and the
median:
 Create k⋅ℓ hash-functions and split them into k distinct groups
(each of size ℓ).
 Within each group use the median for aggregating together
the ℓ results
 Finally take the mean of the k group estimates as the final
estimate.
107
Space Requirement
 As we read the stream it is not necessary to store the
elements seen.
 The only thing we need to keep in main memory is one
integer per hash function; this integer records the largest tail
length seen so far for that hash function and any stream
element.
 If we are processing only one stream, we could use millions
of hash functions, which is far more than we need to get a
close estimate.
 Only if we are trying to process many streams at the same
time would main memory constrain the number of hash
functions we could associate with any one stream.
 In practice, the time it takes to compute hash values for each
stream element would be the more significant limitation on
the number of hash functions we use. 108
Applications
 Web sites often gather statistics on how many unique
users it has seen in each given month. The universal set
is the set of logins for that site, and a stream element is
generated each time a user logs in.
 Amazon: user logs in with their unique login name.
 Google: identifies users by IP addresses.
 Radio-frequency identification (RFID) technology uses
RFID tags and RFID readers (or simply called tags and
readers) to monitor objects in physical world.
 Many events (e.g., TedEx) distribute RFID wrist bands to their
visitors. RFID counting helps reveal the number of people
around.

109
Applications
 DNA Motifs: Sequence motifs are short,
recurring patterns in DNA that are presumed
to have a biological function.
 Number of distinct motifs indicate valuable
biological information about the specific DNA
sequence.
 Denial of service attacks signaled by large
numbers of requests from spoofed IPs.
 Counting distinct elements provide valuable
statistics in these cases.

110
Duplicate Insensitive
Counting
 Distinct-values estimation can also be used as a general tool for
Each item to be counted views its unique id as its “value”, so
that the number of distinct values equals the number of items
to be counted.
 Duplicate-insensitive counting is useful in mobile computing to
avoid double counting nodes that are in motion.
 It can also be used to compute the number of distinct
neighborhoods at a given hop-count from a node and the size of
the transitive closure of a graph.
 In a sensor network, duplicate insensitive counting together
with multi-path in-network aggregation enables robust and
energy-efficient answers to count queries
 Moreover, duplicate insensitive counting is a building block for
duplicate-insensitive computation of other aggregates, such as
sum and average.
111
Some Results
 Wikipedia article on "United States Constitution" had 3978
unique words. When run ten times, Flajolet-Martin
algorithm reported values of 4902, 4202, 4202, 4044, 4367,
3602, 4367, 4202, 4202 and 3891 for an average of 4198. As
can be seen, the average is about right, but the deviation is
between -400 to 1000.
 The algorithm was run on the text dump of The Jungle Book
by Rudyard Kipling. The text was converted into a stream of
tokens and it was found that the total number of unique
tokens was 7150. The approximation of the same using the
Flajolet-Martin algorithm came out to be 7606 which in fact
is pretty close to the actual number.

112
Extra Examples
 Stream: 4, 2, 5 ,9, 1, 6, 3, 7
Hash function, h(x) = (ax + b) mod 32
a) h(x) = 3x + 7 mod 32 b) h(x) = x + 6 mod 32
 a) h(x) = 3x + 7 mod 32
h(4) = 3(4) + 7 mod 32 = 19 mod 32 = 19 = (10011)
h(2) = 3(2) + 7 mod 32 = 13 mod 32 = 13 = (01101)
h(5) = 3(5) + 7 mod 32 = 22 mod 32 = 22 = (10110)
h(9) = 3(9) + 7 mod 32 = 34 mod 32 = 2 = (00010)
h(1) = 3(1) + 7 mod 32 = 10 mod 32 = 10 = (01010)
h(6) = 3(6) + 7 mod 32 = 25 mod 32 = 25 = (11001)
h(3) = 3(3) + 7 mod 32 = 16 mod 32 = 16 = (10000)
h(7) = 3(7) + 7 mod 32 = 28 mod 32 = 28 = (11100)
Trailing zero's {0, 0, 1, 1, 1, 0, 4, 2}
R = max [Trailing Zero] = 4 ---- Output = 2R = 24 = 16

113
Queries over a
(long) Sliding Window
Sliding Windows
 A useful model of stream processing is that
queries are about a window of length N –
the N most recent elements received
 Interesting case: N is so large that the data cannot
be stored in memory, or even on disk
 Or, there are so many streams that windows
for all cannot be stored
 Amazon example:
 For every product X we keep 0/1 stream of whether
that product was sold in the n-th transaction
 We want answer queries, how many times have we sold
X in the last k sales 115
Examples
 Example: For each spam mail seen we
emitted a 1. Now, we want to always know
how many of the last million emails were
spam
 Example: For each tweet seen we emitted a 1
if it is positive remark. Now, we want to
always know how many of the billion recent
tweets were positive sentiments

116
Sliding Window: 1
Stream
 Sliding window on a single stream: N=6

qwertyuiopasdfghjklzxcvbnm

qwertyuiopasdfghjklzxcvbnm

qwertyuiopasdfghjklzxcvbnm

qwertyuiopasdfghjklzxcvbnm

Past Future

117
Counting Bits (1)
 Problem:
 Given a stream of 0s and 1s
 Be prepared to answer queries of the form
How many 1s are in the last k bits? where k ≤ N

 Obvious solution:
Store the most recent N bits
 When new bit comes in, discard the N+1st bit

010011011101010110110110 Suppose N=6


Past Future

118
Counting Bits (2)
 You can not get an exact answer without
storing the entire window
 Real Problem:
What if we cannot afford to store N bits?
 E.g., we’re processing 1 billion streams and
N = 1 billion 0 1 0 0 1 1 0 1 1 1 0 1 0 1 0 1 1 0 1 1 0 1 1 0
Past Future

 But as usual we are happy with an


approximate answer
119
An attempt: Simple
solution
 Q: How many 1s are in the last N bits?
 A simple solution that does not really solve
our problem: Uniformity assumption
N
001110001010010001011011011100101011001101
Past Future

 Maintain 2 counters:
 S: number of 1s from the beginning of the stream
 Z: number of 0s from the beginning of the stream
 How many 1s are in the last N bits?
 But, what if stream is non-uniform?
 What if distribution changes over time?
120
DGIM Method
 DGIM (Datar-Gionis-Indyk-Motwani) solution
that does not assume uniformity
 We store bits per stream
 Solution gives approximate answer,
never off by more than 50%
 Error factor can be reduced to any fraction > 0,
with more complicated algorithm and
proportionally more stored bits

121
DGIM - Overview
 Problem Statement:
 Given a stream of bits (0's and 1's), maintain an
approximation of the count of 1's in the last N bits
of the stream, using sub-linear space with respect
to N.
 Overview:
 The algorithm uses a bucket-based approach
where we group consecutive 1's into "buckets" of
different sizes.
 These buckets are managed to ensure space
efficiency while providing an approximate count of
1's in the sliding window of the last N bits.
122
Key Concepts of the DGIM Algorithm:

 Buckets:
 Instead of storing the entire binary stream, the algorithm
groups consecutive 1's into "buckets."
 Each bucket is represented by two key pieces of information:
 Size: The number of 1's it contains.
 Timestamp: The position of the most recent 1 in the bucket (i.e.,
the time at which the last 1 in the bucket appeared in the stream).
 Exponential Bucket Sizes:
 The buckets are organized in a way that their sizes are
powers of two (e.g., 1, 2, 4, 8, ...).
 There can only be at most two buckets of each size at any
time. This restriction allows the algorithm to maintain a
logarithmic number of buckets.
123
Key Concepts of the DGIM Algorithm:

 Sliding Window:
 The algorithm operates on a sliding window of size
N, which represents the last N bits of the stream.
 As new bits arrive, older bits fall out of the window.
The algorithm must update the buckets to reflect
the current window's contents.
 Approximation:
 The algorithm provides an approximate count of 1's
in the last N bits. The count may be off by at most
50% because of the bucket approximation. However,
this tradeoff allows for significant space savings.
124
DGIM method
 Idea: Instead of summarizing fixed-length
blocks, summarize blocks with specific
number of 1s:
 Let the block sizes (number of 1s) increase
exponentially

 When there are few 1s in the window, block


sizes stay small, so errors are small
0101011000101101010101010101101010101010111010101011101010001011
N

125
DGIM: Timestamps
 Each bit in the stream has a timestamp,
starting 1, 2, …
 Record timestamps modulo N (the window
size), so we can represent any relevant
timestamp in bits

126
DGIM: Buckets
 A bucket in the DGIM method is a record
consisting of:
 (A) The timestamp of its end [O(log N) bits]
 (B) The number of 1s between its beginning and
end [O(log log N) bits]

 Constraint on buckets:
Number of 1s must be a power of 2
 That explains the O(log log N) in (B) above
0101011000101101010101010101101010101010111010101011101010001011
N
127
Reasoning
 To represent a bucket, we need log2 N bits to
represent the timestamp (modulo N) of its right
end.
 To represent the number of 1’s we only need
log2 log2 N bits.
 The reason is that we know this “number i” is a
power of 2, say 2j, so we can represent “i” by
representing (encoding) as “j” in binary.
 Since j is at most log2 N, it requires log2 log2 N
bits.
 Thus, O(logN)bits suffice to represent a bucket.
128
Representing a Stream by Buckets

 Either one or two buckets with the same


power-of-2 number of 1s
 Buckets do not overlap in timestamps

 Buckets are sorted by size


 Earlier buckets are not smaller than later buckets

 Buckets disappear when their


end-time is > N time units in the past

129
In a nut shell

130
Example: Bucketized
Stream

At least 1 of 2 of 2 of 1 of 2 of
size 16. Partially size 8 size 4 size 2 size 1
beyond window.

1010110001011010101010101011010101010101110101010111010100010110

Three properties of buckets that are maintained:


- Either one or two buckets with the same power-of-2 number of 1s
- Buckets do not overlap in timestamps
- Buckets are sorted by size
131
Updating Buckets (1)
 When a new bit comes in, drop the last
(oldest) bucket if its end-time is prior to N
time units before the current time
 2 cases: Current bit is 0 or 1

 If the current bit is 0:


no other changes are needed

132
Updating Buckets (2)
 If the current bit is 1:
 (1) Create a new bucket of size 1, for just this bit
 End timestamp = current time
 (2) If there are now three buckets of size 1,
combine the oldest two into a bucket of size 2
 (3) If there are now three buckets of size 2,
combine the oldest two into a bucket of size 4
 (4) And so on …

133
Example: Updating
Buckets
Current state of the stream:
1010110001011010101010101011010101010101110101010111010100010110

Bit of value 1 arrives


0101100010110101010101010110101010101011101010101110101000101100
Two orange buckets get merged into a yellow bucket
1011000101101010101010101101010101010111010101011101010001011001

Next bit 1 arrives, new orange bucket is created, then 0 comes, then 1:
1100010110101010101010110101010101011101010101110101000101100101

Buckets get merged…


1100010110101010101010110101010101011101010101110101000101100101

State of the buckets after merging


11000101101010101010101101010101010111010101011101010001011001011

134
How to Query?
 To estimate the number of 1s in the most
recent N bits:
1. Sum the sizes of all buckets but the last
(note “size” means the number of 1s in the
bucket)
2. Add half the size of the last bucket

 Remember: We do not know how many 1s


of the last bucket are still within the wanted
window

135
Example: Bucketized
Stream

At least 1 of 2 of 2 of 1 of 2 of
size 16. Partially size 8 size 4 size 2 size 1
beyond window.

1010110001011010101010101011010101010101110101010111010100010110

136
137
138
Counting 1’s in a
Window
* * 1 0 1 1 0 1 1 0 0 0 1 0 1 1 1 0 1 1 0 0 1 0 1 1 0
Timestamps 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63

Bucket (41,8) (49,4) (55,4) (59,2) (61,1)(62,1)

 Window divided into buckets represented by k = 10 bits


 Timestamp of right end
 Size : Number of 1’s in a bucket (pow. of 2)
How many 1’s in last 10 (k) bits?
 Estimating 1’s in last k=10 bits
 Exact Count = 5
 Estimate = Sum size of all but last and add ½ of the last bucket
 = (1) + (1) + (2) + (42) = 6

139
Error Bound: Proof
 Why is error 50%? Let’s prove it!
 Suppose the last bucket has size 2r
 Then by assuming 2r-1 (i.e., half) of its 1s are
still within the window, we make an error of
at most 2r-1
 Since there is at least one bucket of each of
the sizes less than 2r, the true sum is at least
1 + 2 + 4 + .. + 2r-1 = 2r -1
 Thus, error at most 50% At least 16 1s

11111110000000011101010101011010101010101110101010111010100010110
N
140
Further Reducing the
Error
 Instead of maintaining 1 or 2 of each size
bucket, we allow either r-1 or r buckets (r > 2)
 Except for the largest size buckets; we can have
any number between 1 and r of those
 Error is at most O(1/r)
 By picking r appropriately, we can tradeoff
between number of bits we store and the
error

141
Applications
 Network Monitoring: Estimate suspicious packets (e.g., DDoS
attacks) in recent traffic.
 Web Analytics: Count clicks or active users in a recent window.
 Log Analysis: Monitor error/warning logs in real-time systems.
 Sensor Networks: Estimate recent sensor events (e.g., motion
detection).
 Financial Trading: Track recent buy/sell signals for trading
decisions.
 Social Media: Monitor likes/shares on posts in real-time.
 Spam Detection: Count spam messages in recent emails.
 Smart Grids: Detect power outages in a sliding window for grid
management.
 Video Surveillance: Track motion detection events in security
systems.
 Fraud Detection: Estimate suspicious transactions in real-time
fraud systems.
142
Extensions
 Can we use the same trick to answer queries
How many 1’s in the last k? where k < N?
 A: Find earliest bucket B that at overlaps with k.
Number of 1s is the sum of sizes of more recent
buckets + ½ size of B
1010110001011010101010101011010101010101110101010111010100010110
k

 Can we handle the case where the stream is


not bits, but integers, and we want the sum
of the last k elements?
143
Counting positive
integers

144
Summary
 Stream Computation and Model
 Characteristics of Stream mining
Algorithms
 Sampling
 Filtering – Bloom Filter
 Counting Distinct Elements - FM
Algorithm
 Counting the number of 1s in the last N
elements - DGIM Algorithm
145

You might also like