Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
139 views12 pages

BDA Unit-4

Uploaded by

status wind sk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
139 views12 pages

BDA Unit-4

Uploaded by

status wind sk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Unit-4: Mining Data Stream

Introduction Data Stream Mining


What is Data Stream Mining?
 It is an activity of collecting insights from continuous high-speed data records which comes
to the system in a stream.

Fig. Data Stream Mining

Characteristics of Data Stream Mining


1. Continuous Stream of Data:
o High amount of data in an infinite stream.
o We do not know entire database.
2. Concept Drifting
o The data keep growing or changing over time.
3. Volatility of Data
o The system does not store the data received – limited resource. When data is
analyzed, it is discarded or summarized.

Stream Data & Processing


Stream Data
 Streaming data is becoming a core component of enterprise data architecture due to the
explosive growth of data from non-traditional sources such as IoT sensors, security logs and
web applications.
 Streaming data refers to data that is continuously generated, usually in high volumes and
at high velocity.
 A streaming data source would typically consist of a stream of logs that record events as
they happen – such as a user clicking on a link in a web page, or a sensor reporting the
current temperature.
 Examples of streaming data include:

1
Figure 2 - Example of Stream Data

Stream Processing
 Stream processing used to be a ‘niche’ technology used only by a small subset of companies.
 However, with the rapid growth of SaaS, IoT and machine learning, organizations across
industries are now reducing their feet into streaming analytics.
 It’s difficult to find a modern company that doesn’t have an app or a website; as traffic to
these digital assets grows, and with increasing craving for complex and real-time analytics,
the need to adopt modern data infrastructure is quickly becoming mainstream.
 While traditional batch architectures can be sufficient at smaller scales, stream processing
provides several benefits that other data platforms cannot given.

Benefit of Stream Processing

1. Able to deal with never-ending streams of events


 Several kinds of data are naturally structured.
 Traditional batch processing tools require stopping the stream of events.
 It captures batches of data and combining the batches to draw overall conclusions.
 It provides immediate insights from large volumes of streaming data.

2. Real-time or Near-Real-time Processing


 Most organizations adopt stream processing to enable real time data analytics.
 Even though with high performance database systems in real time analytics is possible,
and data is moved to stream processing model.

3. Detecting Patterns in time-series data


 Identify patterns over time, for example trends in website traffic data.
 It requires data to be continuously processed and analyzed.
 Batch processing makes this more difficult because it breaks data into batches, meaning
some events are broken across two or more batches.

4. Easy Data Scalability


 Increasing data volumes can break a batch processing system which required additional
resources or modify the architecture.
 Modern stream processing infrastructure is hyper-scalable, able to deal with Gigabytes
of data per second with a single stream processor which deal easily with growing data
volumes without infrastructure changes.

2
Stream Data Model and Architecture
 A streaming data architecture is a framework of software components construct to absorb
and process large amount of streaming data from numerous sources.
 While traditional data solutions focused on writing and reading data in batches.
 A streaming data architecture ingest data immediately as it is produced, is persistence in its
storage, and may include various additional components per use case - such as tools for
real-time processing, data manipulation and analytics.
 Streaming architectures must be able to address for the unique feature of data streams,
which tend to make huge amounts of data (terabytes to petabytes) that it is at best semi-
structured and requires significant pre-processing.

Streaming Architecture Components

The Message Broker / Stream Processor


 This is the element that takes data from a source, called a producer, translates it into a
standard message format, and streams it on an ongoing basis.
 Other components can then listen in and consume the messages passed on by the broker.
 The first generation of message brokers, such as RabbitMQ and Apache ActiveMQ which
relied on the Message Oriented Middleware (MOM) paradigm.
 Two popular stream processing tools are Apache Kafka and Amazon Kinesis Data Streams.

Figure 3 - Message Broker / Stream Processor

Batch and Real-time ETL tools


 Data streams from one or more message brokers need to be accumulated, reconstruct and
restructured before data can be analyzed with SQL-based analytics tools.
 This should be possible by an ETL tool or platform receives queries from users, fetches
events from message queues and applies the query, to generate a result – often performing
additional joins, transformations on aggregations on the data.

3
 The result may be an API call, an action, a visualization, an alert, or in some cases a new data
stream.

Figure 4 - Batch and Real-time ETL tools

Data Analytics / Serverless Query Engine


 After streaming data is prepared for consumption by the stream processor, it must be
visualizing providing value.
 There are many numerous approaches to streaming data analytics.

Streaming Data Storage


 As every organization has a massive data to store, they opt for cheap storage options. So
they choose storing data in streaming way.

Stream Computing
 A high-performance computer system that analyzes multiple data streams from many
sources live.
 The word stream in stream computing is used to mean pulling in streams of data, processing
the data and streaming it back out as a single flow.

Figure 5 - Stream Computing

 Stream computing uses software algorithms that analyzes the data in real time as it streams
in to increase speed and accuracy when dealing with data handling and analysis.

4
Sampling Data in Stream
 The sample in data stream is taken much smaller than the whole stream.
 This can be designed to retain many relevant features of the stream. The same can be used
to calculate many relevant combinations on the stream.
 Unlike sampling from a stored data set, stream sampling should be performed on the web,
when data arrives. Any component that is not stored inside the sample is lost everlastingly
and cannot be recovered.

Filtering Stream
 Due to the nature of data streams, stream filtering is one out of the major useful and
practical approaches to efficient stream evaluation.
 Two filtering approaches are in Data Stream Management Systems:
1. Implicit Filtering
2. Explicit Filtering

Implicit Filtering
 Data stream management system cope with the high rates and the bursty nature of streams
in several ways in the order to guarantee stability under heavy loads.
 Some of them employ various load shedding techniques which reduce the load by
processing only a fraction of the items from the stream and discarding others without any
processing.
 The Aurora DSMS employs random and semantic load shedding techniques to deal with the
unpredictable nature of data streams, where semantic load shedding makes use of tuple
utility computed based on quality of service – QoS parameters.
 Automatically, the system drops tuples that are assumed to be less important for stream
evaluation.
 QoS of the system is captured by several functions:
 Latency graph, which specifies the utility of a tuple as a function of tuple propagation
through the query plan.
 Value-based graph, which specifies which values of the output are more important than
others.
 Loss tolerant graph, which describes how sensitive the application is to approximate
answer.
 A strategy of dropping tuples at the early stages of the query plan makes the process of
query evaluation more efficient for subsequent operators in the plan.

Explicit Filtering
 As per implicit filtering techniques may have negative impacts on a variety of data steam
analyze problems, such as computation of sketches and samples of distinct items and other
properties of stream.

5
 Also, problem in this category include estimation of IP network flow sizes, detection of
worm signature generation in the network.
 Fine-grained estimation of network traffic (flows) volume is very important in various
network analysis tasks.
 It offers threshold sampling algorithm that generates sample of stream items with
guarantees on estimated flow sizes.
 The procedure of sample generation maintains a value of threshold, which is compared to
the size of item, and make decision whether item of stream filtered out or retained in
sample.
 Due to the nature of this sampling algorithm, with a number of various parameters, such as
few items in the final sample, item size, threshold value, count of items larger than the
threshold value.

Counting Distinct Elements in Stream


 Let a = a1,... ,an be a sequence of n elements from the domain [m] = {1, . . . , m}.
 The zeroth-frequency moment of this sequence is the number of distinct elements that
occur in the sequence and is denoted F0 = F0(a).
 In the data stream model, an algorithm is considered efficient if it makes one (or a small
number of) passes over the input sequence, uses very little space, and processes each
element of the input very quickly.
 In our context, a data stream algorithm to approximate F0 is considered efficient if it uses
only poly(1/ε, log n, log m) bits of memory, where 1 ± ε is the factor within which F0 must
be approximated.
 Let ε, δ > 0 be given.
 An algorithm A is said to (ε, δ)-approximate F0 if for any sequence a = a ,... ,a , with each ai ∈
*m+, it outputs a number F~0 such that Pr *! ! ! F0 − F 0 ! ! ! ≤ ε F0 + ≥ 1−δ, where the
probability is taken over the internal coin tosses of A.
 Two main parameters of A are of interest: the workspace and the time to process each item.
We study these quantities as functions of the domain size m, the number n of elements in
the stream, the approximation parameter €, and the confidence parameter δ.

Estimating Moments
 Alon-Matias-Szegedy (AMS) Algorithm
o Ex Stream : a, b, c, b, d, a, c, d, a, b, d, c, a, a, b
o n=15; a(x5), b(x4), c(×3), d(x3); 2nd Moment = 59
o Estimate = n * (2 * X.value - 1)
o Pick random positions X1(3rd), X2(8th), X3(13th)
o X1.element = "c", X1.value = 3 (# of "c" in the set from 3rd place)
o Estimate for X1= 15*(2*3-1) = 75
o Estimates for X2 = 15* (2*2-1) = 45, (# "d"beyond 8th place = 2)

6
o Estimate for X3 = 45, (# "a" beyond 13th place = 2)
o Average for X1, X2, X3 = 165/3 = 55 (close to 59)
 In case of infinite streams
o As we store one variable per randomly chosen position, so the challenge is not
selecting 'n'.
o Selecting the position is challenge.
 Strategy for position selection assuming we have space to store "s" variables, and we have
seen "n' elements.
o First "s" positions are chosen.
o When (n+1)th token arrives, random selection takes place.
o If (n+1) is selected discard from existing n position randomly and insert new element
with value 1.

Counting Oneness in a Window


 Let’s suppose a window of length N on a binary stream. We want at all times to be able to
answer queries of the form “how many 1’s are there in the last k bits?” for any k≤ N. For this
purpose we use the DGIM algorithm.
 The basic version of the algorithm uses O(log2 N) bits to represent a window of N bits, and
allows us to estimate the number of 1’s in the window with an error of no more than 50%.
 To begin, each bit of the stream has a timestamp, the position in which it arrives. The first
bit has timestamp 1, the second has timestamp 2, and so on.
 Since we only need to distinguish positions within the window of length N, we shall
represent timestamps modulo N, so they can be represented by log2 N bits.
 If we also store the total number of bits ever seen in the stream (i.e., the most recent
timestamp) modulo N, then we can determine from a timestamp modulo N where in the
current window the bit with that timestamp is.
 We divide the window into buckets, 5 consisting of:
o The timestamp of its right (most recent) end.
o The number of 1’s in the bucket. This number must be a power of 2, and we refer to
the number of 1’s as the size of the bucket.
 To represent a bucket, we need log2 N bits to represent the timestamp (modulo N) of its
right end.
 To represent the number of 1’s we only need log2 log2 N bits. The reason is that we know
this number i is a power of 2, say 2j , so we can represent i by coding j in binary.
 Since j is at most log2 N, it requires log2 log2 N bits. Thus, O(logN) bits suffice to represent a
bucket.

7
Figure 6 - A bit stream divided into bucket following DGIM rules

 There are six rules that must be followed when representing a stream by buckets.
1. The right end of a bucket is always a position with a 1.
2. Every position with a 1 is in some bucket.
3. No position is in more than one bucket.
4. There are one or two buckets of any given size, up to some maximum size.
5. All sizes must be a power of 2.
6. Buckets cannot decrease in size as we move to the left (back in time).

Decaying Window
 This algorithm allows you to identify the most popular elements (trending, in other words)
in an incoming data stream.
 The decaying window algorithm not only tracks the most recurring elements in an incoming
data stream, but also discounts any random spikes or spam requests that might have
boosted an element’s frequency.
 In a decaying window, you assign a score or weight to every element of the incoming data
stream. Further, you need to calculate the aggregate sum for each distinct element by
adding all the weights assigned to that element.
 The element with the highest total score is listed as trending or the most popular.
1. Assign each element with a weight/score.
2. Calculate aggregate sum for each distinct element by adding all the weights assigned
to that element.
 In a decaying window algorithm, you assign more weight to newer elements.
 For a new element, you first reduce the weight of all the existing elements by a constant
factor k and then assign the new element with a specific weight.
 The aggregate sum of the decaying exponential weights can be calculated using the
following formula:

∑𝑡 − 1i = 0 𝑎𝑡 − i(1 − 𝑐)i

 Here, c is usually a small constant of the order.


 Whenever a new element, say 𝑎𝑡 + 1, arrives in the data stream you perform the following
steps to achieve an updated sum:
o Multiply the current sum/score by the value (1−c).
o Add the weight corresponding to the new element.

8
Figure 7 - Weight decays exponentially over time

 In a data stream consisting of various elements, you maintain a separate sum for each
distinct element.
 For every incoming element, you multiply the sum of all the existing elements by a value of
(1−c). Further, you add the weight of the incoming element to its corresponding aggregate
sum.
 A threshold can be kept to, ignore elements of weight lesser than that.
 Finally, the element with the highest aggregate score is listed as the most popular element.

Example
 For example, consider a sequence of twitter tags below:
o fifa, ipl, fifa, ipl, ipl, ipl, fifa
 Also, let's say each element in sequence has weight of 1.
o Let's c be 0.1
 The aggregate sum of each tag in the end of above stream will be calculated as below:
 fifa:
o fifa - 1 * (1-0.1) = 0.9
o ipl - 0.9 * (1-0.1) + 0 = 0.81 (adding 0 because current tag is different than fifa)
o fifa - 0.81 * (1-0.1) + 1 = 1.729 (adding 1 because current tag is fifa only)
o ipl - 1.729 * (1-0.1) + 0 = 1.5561
o ipl - 1.5561 * (1-0.1) + 0 = 1.4005
o ipl - 1.4005 * (1-0.1) + 0 = 1.2605
o fifa - 1.2605 * (1-0.1) + 1 = 2.135
 ipl:
o fifa - 0 * (1-0.1) = 0
o ipl - 0 * (1-0.1) + 1 = 1
o fifa - 1 * (1-0.1) + 0 = 0.9 (adding 0 because current tag is different than ipl)
o ipl - 0.9 * (1-0.01) + 1 = 1.81
o ipl - 1.81 * (1-0.01) + 1 = 2.7919
o ipl - 2.7919 * (1-0.01) + 1 = 3.764
o fifa - 3.764 * (1-0.01) + 0 = 3.7264
 In the end of the sequence, we can see the score of fifa is 2.135 but ipl is 3.7264
 So, ipl is more trending then fifa
 Even though both occurred same number of times in input their score is still different.

9
Advantages of Decaying Window Algorithm:
 Sudden spikes or spam data is taken care.
 New element is given more weight by this mechanism, to achieve right trending output.

Real Time Analytics Platform Application


 Fraud detection system for online transactions
 Log analysis for understanding usage pattern
 Click analysis for online recommendations
 Social media analytics
 Push notifications to the customers for location-based advertisements for retails
 Action for emergency services such as fires and accidents in an industry
 Any abnormal measurements require immediate reaction in healthcare monitoring

Real-time RTAP Application

1. Apache Samza
 Apache Samza is an open source, near-real time, asynchronous computational framework
for stream processing developed by the Apache Software Foundation in Scala and Java.
 Samza allows users to build stateful applications that process data in real-time from multiple
sources including Apache Kafka.
 Samza provides fault tolerance, isolation and stateful processing. Samza is used by multiple
companies. The biggest installation is in LinkedIn.

2. Apache Flink
 Apache Flink is an open-source, unified stream-processing and batch-processing framework
developed by the Apache Software Foundation.
 The core of Apache Flink is a distributed streaming data-flow engine written in Java and
Scala.
 Fink provides a high-throughput, low-latency streaming engine as well as support for event-
time processing and state management.
 Flink does not provide its own data-storage system, but provides data-source and sink
connectors to systems such as Amazon Kinesis, Apache Kafka, Alluxio, HDFS, Apache
Cassandra, and Elastic Search.

Real Time Sentiment Analysis


 Sentiment analysis is a text analysis tool that use of natural language processing
(NLP), machine learning, and other data analysis techniques to read, analyze and derive
objective quantitative results from raw text.
 These texts classify as positive, negative, neural and everywhere in between.
 It can read all manner of text for opinion and emotion – to understand the thoughts and
feelings of writer.

10
 Sentiment analysis is also known as opinion mining.

 Real-time sentiment analysis is an AI-powered solution to track mentions of your brand


and products, wherever they may appear, and automatically analyze them with almost
no human input needed.

Figure 8 - Sentiment Analysis Process

 Benefits of performing real time sentiment analysis:


o Marketing campaign success analysis
o Prevention of business disasters
o Instant product feedback
o Stock market predictions
 Basic components of an opinion
o Opinion Holder
o Object
o Opinion
 Opinion Mining Tasks
o At Documents or review Level
 Task: Sentiment classification of reviews
 Classes: Positive, Negative and Neutral
o At the Sentence Level
 Task-1: Identifying subjective/opinionated sentences
 Classes: Objective and subjective
 Task-2: Sentiment classification of sentences
 Classes: Positive, Negative, and Neutral
 Opinion Mining Tasks
o At feature level
 Task-1: Identify and extract object features that have been commented
on by an opinion holder – e.g. reviewer
 Task-2: Determine whether the opinions on the features are positive,
negative or neutral

11
 Task-3: Group Feature Synonyms – it produces featured based opinion
summary of multiple review.
 Two types of evaluations:
o Regular Opinions
 Sentiment/opinion expression on some target entities, e.g. products,
events, topics, persons.
 Direct Opinion and Indirect Opinion
o Comparative Opinion
 Comparisons of more than one entity.
 Example: Buy an iPhone and put a review.

Case Study – Stock Market Predictions


 The necessity of Data Mining
o Data
o Information
o Business Decisions
 Principal of Stock market data mining algorithm.
 Graph Analytics and Big Data.
 Graph analytics is an analytics alternative that uses an abstraction called a graph model.
 Graph analytics is an alternative to the traditional data warehouse model as a framework
for absorbing both structured and unstructured data.
 It employs the graph abstraction for representing connectivity, consisting of a collection of
vertices – nodes.
 Predictable interactive performance graph.

12

You might also like