0% found this document useful (0 votes)

139 views12 pages

BDA Unit-4

Uploaded by

status wind sk

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

139 views12 pages

BDA Unit-4

Uploaded by

status wind sk

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Unit-4: Mining Data Stream

Introduction Data Stream Mining

What is Data Stream Mining?
 It is an activity of collecting insights from continuous high-speed data records which comes
to the system in a stream.

Fig. Data Stream Mining

Characteristics of Data Stream Mining

1. Continuous Stream of Data:
o High amount of data in an infinite stream.
o We do not know entire database.
2. Concept Drifting
o The data keep growing or changing over time.
3. Volatility of Data
o The system does not store the data received – limited resource. When data is
analyzed, it is discarded or summarized.

Stream Data & Processing

Stream Data
 Streaming data is becoming a core component of enterprise data architecture due to the
explosive growth of data from non-traditional sources such as IoT sensors, security logs and
web applications.
 Streaming data refers to data that is continuously generated, usually in high volumes and
at high velocity.
 A streaming data source would typically consist of a stream of logs that record events as
they happen – such as a user clicking on a link in a web page, or a sensor reporting the
current temperature.
 Examples of streaming data include:

1
Figure 2 - Example of Stream Data

Stream Processing
 Stream processing used to be a ‘niche’ technology used only by a small subset of companies.
 However, with the rapid growth of SaaS, IoT and machine learning, organizations across
industries are now reducing their feet into streaming analytics.
 It’s difficult to find a modern company that doesn’t have an app or a website; as traffic to
these digital assets grows, and with increasing craving for complex and real-time analytics,
the need to adopt modern data infrastructure is quickly becoming mainstream.
 While traditional batch architectures can be sufficient at smaller scales, stream processing
provides several benefits that other data platforms cannot given.

Benefit of Stream Processing

1. Able to deal with never-ending streams of events

 Several kinds of data are naturally structured.
 Traditional batch processing tools require stopping the stream of events.
 It captures batches of data and combining the batches to draw overall conclusions.
 It provides immediate insights from large volumes of streaming data.

2. Real-time or Near-Real-time Processing

 Most organizations adopt stream processing to enable real time data analytics.
 Even though with high performance database systems in real time analytics is possible,
and data is moved to stream processing model.

3. Detecting Patterns in time-series data

 Identify patterns over time, for example trends in website traffic data.
 It requires data to be continuously processed and analyzed.
 Batch processing makes this more difficult because it breaks data into batches, meaning
some events are broken across two or more batches.

4. Easy Data Scalability

 Increasing data volumes can break a batch processing system which required additional
resources or modify the architecture.
 Modern stream processing infrastructure is hyper-scalable, able to deal with Gigabytes
of data per second with a single stream processor which deal easily with growing data
volumes without infrastructure changes.

2
Stream Data Model and Architecture
 A streaming data architecture is a framework of software components construct to absorb
and process large amount of streaming data from numerous sources.
 While traditional data solutions focused on writing and reading data in batches.
 A streaming data architecture ingest data immediately as it is produced, is persistence in its
storage, and may include various additional components per use case - such as tools for
real-time processing, data manipulation and analytics.
 Streaming architectures must be able to address for the unique feature of data streams,
which tend to make huge amounts of data (terabytes to petabytes) that it is at best semi-
structured and requires significant pre-processing.

Streaming Architecture Components

The Message Broker / Stream Processor

 This is the element that takes data from a source, called a producer, translates it into a
standard message format, and streams it on an ongoing basis.
 Other components can then listen in and consume the messages passed on by the broker.
 The first generation of message brokers, such as RabbitMQ and Apache ActiveMQ which
relied on the Message Oriented Middleware (MOM) paradigm.
 Two popular stream processing tools are Apache Kafka and Amazon Kinesis Data Streams.

Figure 3 - Message Broker / Stream Processor

Batch and Real-time ETL tools

 Data streams from one or more message brokers need to be accumulated, reconstruct and
restructured before data can be analyzed with SQL-based analytics tools.
 This should be possible by an ETL tool or platform receives queries from users, fetches
events from message queues and applies the query, to generate a result – often performing
additional joins, transformations on aggregations on the data.

3
 The result may be an API call, an action, a visualization, an alert, or in some cases a new data
stream.

Figure 4 - Batch and Real-time ETL tools

Data Analytics / Serverless Query Engine

 After streaming data is prepared for consumption by the stream processor, it must be
visualizing providing value.
 There are many numerous approaches to streaming data analytics.

Streaming Data Storage

 As every organization has a massive data to store, they opt for cheap storage options. So
they choose storing data in streaming way.

Stream Computing
 A high-performance computer system that analyzes multiple data streams from many
sources live.
 The word stream in stream computing is used to mean pulling in streams of data, processing
the data and streaming it back out as a single flow.

Figure 5 - Stream Computing

 Stream computing uses software algorithms that analyzes the data in real time as it streams
in to increase speed and accuracy when dealing with data handling and analysis.

4
Sampling Data in Stream
 The sample in data stream is taken much smaller than the whole stream.
 This can be designed to retain many relevant features of the stream. The same can be used
to calculate many relevant combinations on the stream.
 Unlike sampling from a stored data set, stream sampling should be performed on the web,
when data arrives. Any component that is not stored inside the sample is lost everlastingly
and cannot be recovered.

Filtering Stream
 Due to the nature of data streams, stream filtering is one out of the major useful and
practical approaches to efficient stream evaluation.
 Two filtering approaches are in Data Stream Management Systems:
1. Implicit Filtering
2. Explicit Filtering

Implicit Filtering
 Data stream management system cope with the high rates and the bursty nature of streams
in several ways in the order to guarantee stability under heavy loads.
 Some of them employ various load shedding techniques which reduce the load by
processing only a fraction of the items from the stream and discarding others without any
processing.
 The Aurora DSMS employs random and semantic load shedding techniques to deal with the
unpredictable nature of data streams, where semantic load shedding makes use of tuple
utility computed based on quality of service – QoS parameters.
 Automatically, the system drops tuples that are assumed to be less important for stream
evaluation.
 QoS of the system is captured by several functions:
 Latency graph, which specifies the utility of a tuple as a function of tuple propagation
through the query plan.
 Value-based graph, which specifies which values of the output are more important than
others.
 Loss tolerant graph, which describes how sensitive the application is to approximate
answer.
 A strategy of dropping tuples at the early stages of the query plan makes the process of
query evaluation more efficient for subsequent operators in the plan.

Explicit Filtering
 As per implicit filtering techniques may have negative impacts on a variety of data steam
analyze problems, such as computation of sketches and samples of distinct items and other
properties of stream.

5
 Also, problem in this category include estimation of IP network flow sizes, detection of
worm signature generation in the network.
 Fine-grained estimation of network traffic (flows) volume is very important in various
network analysis tasks.
 It offers threshold sampling algorithm that generates sample of stream items with
guarantees on estimated flow sizes.
 The procedure of sample generation maintains a value of threshold, which is compared to
the size of item, and make decision whether item of stream filtered out or retained in
sample.
 Due to the nature of this sampling algorithm, with a number of various parameters, such as
few items in the final sample, item size, threshold value, count of items larger than the
threshold value.

Counting Distinct Elements in Stream

 Let a = a1,... ,an be a sequence of n elements from the domain [m] = {1, . . . , m}.
 The zeroth-frequency moment of this sequence is the number of distinct elements that
occur in the sequence and is denoted F0 = F0(a).
 In the data stream model, an algorithm is considered efficient if it makes one (or a small
number of) passes over the input sequence, uses very little space, and processes each
element of the input very quickly.
 In our context, a data stream algorithm to approximate F0 is considered efficient if it uses
only poly(1/ε, log n, log m) bits of memory, where 1 ± ε is the factor within which F0 must
be approximated.
 Let ε, δ > 0 be given.
 An algorithm A is said to (ε, δ)-approximate F0 if for any sequence a = a ,... ,a , with each ai ∈
*m+, it outputs a number F~0 such that Pr *! ! ! F0 − F 0 ! ! ! ≤ ε F0 + ≥ 1−δ, where the
probability is taken over the internal coin tosses of A.
 Two main parameters of A are of interest: the workspace and the time to process each item.
We study these quantities as functions of the domain size m, the number n of elements in
the stream, the approximation parameter €, and the confidence parameter δ.

Estimating Moments
 Alon-Matias-Szegedy (AMS) Algorithm
o Ex Stream : a, b, c, b, d, a, c, d, a, b, d, c, a, a, b
o n=15; a(x5), b(x4), c(×3), d(x3); 2nd Moment = 59
o Estimate = n * (2 * X.value - 1)
o Pick random positions X1(3rd), X2(8th), X3(13th)
o X1.element = "c", X1.value = 3 (# of "c" in the set from 3rd place)
o Estimate for X1= 15*(2*3-1) = 75
o Estimates for X2 = 15* (2*2-1) = 45, (# "d"beyond 8th place = 2)

6
o Estimate for X3 = 45, (# "a" beyond 13th place = 2)
o Average for X1, X2, X3 = 165/3 = 55 (close to 59)
 In case of infinite streams
o As we store one variable per randomly chosen position, so the challenge is not
selecting 'n'.
o Selecting the position is challenge.
 Strategy for position selection assuming we have space to store "s" variables, and we have
seen "n' elements.
o First "s" positions are chosen.
o When (n+1)th token arrives, random selection takes place.
o If (n+1) is selected discard from existing n position randomly and insert new element
with value 1.

Counting Oneness in a Window

 Let’s suppose a window of length N on a binary stream. We want at all times to be able to
answer queries of the form “how many 1’s are there in the last k bits?” for any k≤ N. For this
purpose we use the DGIM algorithm.
 The basic version of the algorithm uses O(log2 N) bits to represent a window of N bits, and
allows us to estimate the number of 1’s in the window with an error of no more than 50%.
 To begin, each bit of the stream has a timestamp, the position in which it arrives. The first
bit has timestamp 1, the second has timestamp 2, and so on.
 Since we only need to distinguish positions within the window of length N, we shall
represent timestamps modulo N, so they can be represented by log2 N bits.
 If we also store the total number of bits ever seen in the stream (i.e., the most recent
timestamp) modulo N, then we can determine from a timestamp modulo N where in the
current window the bit with that timestamp is.
 We divide the window into buckets, 5 consisting of:
o The timestamp of its right (most recent) end.
o The number of 1’s in the bucket. This number must be a power of 2, and we refer to
the number of 1’s as the size of the bucket.
 To represent a bucket, we need log2 N bits to represent the timestamp (modulo N) of its
right end.
 To represent the number of 1’s we only need log2 log2 N bits. The reason is that we know
this number i is a power of 2, say 2j , so we can represent i by coding j in binary.
 Since j is at most log2 N, it requires log2 log2 N bits. Thus, O(logN) bits suffice to represent a
bucket.

7
Figure 6 - A bit stream divided into bucket following DGIM rules

 There are six rules that must be followed when representing a stream by buckets.
1. The right end of a bucket is always a position with a 1.
2. Every position with a 1 is in some bucket.
3. No position is in more than one bucket.
4. There are one or two buckets of any given size, up to some maximum size.
5. All sizes must be a power of 2.
6. Buckets cannot decrease in size as we move to the left (back in time).

Decaying Window
 This algorithm allows you to identify the most popular elements (trending, in other words)
in an incoming data stream.
 The decaying window algorithm not only tracks the most recurring elements in an incoming
data stream, but also discounts any random spikes or spam requests that might have
boosted an element’s frequency.
 In a decaying window, you assign a score or weight to every element of the incoming data
stream. Further, you need to calculate the aggregate sum for each distinct element by
adding all the weights assigned to that element.
 The element with the highest total score is listed as trending or the most popular.
1. Assign each element with a weight/score.
2. Calculate aggregate sum for each distinct element by adding all the weights assigned
to that element.
 In a decaying window algorithm, you assign more weight to newer elements.
 For a new element, you first reduce the weight of all the existing elements by a constant
factor k and then assign the new element with a specific weight.
 The aggregate sum of the decaying exponential weights can be calculated using the
following formula:

∑𝑡 − 1i = 0 𝑎𝑡 − i(1 − 𝑐)i

 Here, c is usually a small constant of the order.

 Whenever a new element, say 𝑎𝑡 + 1, arrives in the data stream you perform the following
steps to achieve an updated sum:
o Multiply the current sum/score by the value (1−c).
o Add the weight corresponding to the new element.

8
Figure 7 - Weight decays exponentially over time

 In a data stream consisting of various elements, you maintain a separate sum for each
distinct element.
 For every incoming element, you multiply the sum of all the existing elements by a value of
(1−c). Further, you add the weight of the incoming element to its corresponding aggregate
sum.
 A threshold can be kept to, ignore elements of weight lesser than that.
 Finally, the element with the highest aggregate score is listed as the most popular element.

Example
 For example, consider a sequence of twitter tags below:
o fifa, ipl, fifa, ipl, ipl, ipl, fifa
 Also, let's say each element in sequence has weight of 1.
o Let's c be 0.1
 The aggregate sum of each tag in the end of above stream will be calculated as below:
 fifa:
o fifa - 1 * (1-0.1) = 0.9
o ipl - 0.9 * (1-0.1) + 0 = 0.81 (adding 0 because current tag is different than fifa)
o fifa - 0.81 * (1-0.1) + 1 = 1.729 (adding 1 because current tag is fifa only)
o ipl - 1.729 * (1-0.1) + 0 = 1.5561
o ipl - 1.5561 * (1-0.1) + 0 = 1.4005
o ipl - 1.4005 * (1-0.1) + 0 = 1.2605
o fifa - 1.2605 * (1-0.1) + 1 = 2.135
 ipl:
o fifa - 0 * (1-0.1) = 0
o ipl - 0 * (1-0.1) + 1 = 1
o fifa - 1 * (1-0.1) + 0 = 0.9 (adding 0 because current tag is different than ipl)
o ipl - 0.9 * (1-0.01) + 1 = 1.81
o ipl - 1.81 * (1-0.01) + 1 = 2.7919
o ipl - 2.7919 * (1-0.01) + 1 = 3.764
o fifa - 3.764 * (1-0.01) + 0 = 3.7264
 In the end of the sequence, we can see the score of fifa is 2.135 but ipl is 3.7264
 So, ipl is more trending then fifa
 Even though both occurred same number of times in input their score is still different.

9
Advantages of Decaying Window Algorithm:
 Sudden spikes or spam data is taken care.
 New element is given more weight by this mechanism, to achieve right trending output.

Real Time Analytics Platform Application

 Fraud detection system for online transactions
 Log analysis for understanding usage pattern
 Click analysis for online recommendations
 Social media analytics
 Push notifications to the customers for location-based advertisements for retails
 Action for emergency services such as fires and accidents in an industry
 Any abnormal measurements require immediate reaction in healthcare monitoring

Real-time RTAP Application

1. Apache Samza
 Apache Samza is an open source, near-real time, asynchronous computational framework
for stream processing developed by the Apache Software Foundation in Scala and Java.
 Samza allows users to build stateful applications that process data in real-time from multiple
sources including Apache Kafka.
 Samza provides fault tolerance, isolation and stateful processing. Samza is used by multiple
companies. The biggest installation is in LinkedIn.

2. Apache Flink
 Apache Flink is an open-source, unified stream-processing and batch-processing framework
developed by the Apache Software Foundation.
 The core of Apache Flink is a distributed streaming data-flow engine written in Java and
Scala.
 Fink provides a high-throughput, low-latency streaming engine as well as support for event-
time processing and state management.
 Flink does not provide its own data-storage system, but provides data-source and sink
connectors to systems such as Amazon Kinesis, Apache Kafka, Alluxio, HDFS, Apache
Cassandra, and Elastic Search.

Real Time Sentiment Analysis

 Sentiment analysis is a text analysis tool that use of natural language processing
(NLP), machine learning, and other data analysis techniques to read, analyze and derive
objective quantitative results from raw text.
 These texts classify as positive, negative, neural and everywhere in between.
 It can read all manner of text for opinion and emotion – to understand the thoughts and
feelings of writer.

10
 Sentiment analysis is also known as opinion mining.

 Real-time sentiment analysis is an AI-powered solution to track mentions of your brand

and products, wherever they may appear, and automatically analyze them with almost
no human input needed.

Figure 8 - Sentiment Analysis Process

 Benefits of performing real time sentiment analysis:

o Marketing campaign success analysis
o Prevention of business disasters
o Instant product feedback
o Stock market predictions
 Basic components of an opinion
o Opinion Holder
o Object
o Opinion
 Opinion Mining Tasks
o At Documents or review Level
 Task: Sentiment classification of reviews
 Classes: Positive, Negative and Neutral
o At the Sentence Level
 Task-1: Identifying subjective/opinionated sentences
 Classes: Objective and subjective
 Task-2: Sentiment classification of sentences
 Classes: Positive, Negative, and Neutral
 Opinion Mining Tasks
o At feature level
 Task-1: Identify and extract object features that have been commented
on by an opinion holder – e.g. reviewer
 Task-2: Determine whether the opinions on the features are positive,
negative or neutral

11
 Task-3: Group Feature Synonyms – it produces featured based opinion
summary of multiple review.
 Two types of evaluations:
o Regular Opinions
 Sentiment/opinion expression on some target entities, e.g. products,
events, topics, persons.
 Direct Opinion and Indirect Opinion
o Comparative Opinion
 Comparisons of more than one entity.
 Example: Buy an iPhone and put a review.

Case Study – Stock Market Predictions

 The necessity of Data Mining
o Data
o Information
o Business Decisions
 Principal of Stock market data mining algorithm.
 Graph Analytics and Big Data.
 Graph analytics is an analytics alternative that uses an abstraction called a graph model.
 Graph analytics is an alternative to the traditional data warehouse model as a framework
for absorbing both structured and unstructured data.
 It employs the graph abstraction for representing connectivity, consisting of a collection of
vertices – nodes.
 Predictable interactive performance graph.

Manual Bvk260 Us 001
No ratings yet
Manual Bvk260 Us 001
317 pages
Oop Practice Questions
No ratings yet
Oop Practice Questions
14 pages
B311-221 10.0.1.1 (H187SP60C983) Firmware Release Notes
100% (1)
B311-221 10.0.1.1 (H187SP60C983) Firmware Release Notes
10 pages
Unit 5
No ratings yet
Unit 5
27 pages
Implementation of DevSecOps by Integrating Static and Dynamic Security Testing in CI CD Pipelines
No ratings yet
Implementation of DevSecOps by Integrating Static and Dynamic Security Testing in CI CD Pipelines
6 pages
Basic Pentesting - 2 - CTF Walkthrough - Infosec Resources
No ratings yet
Basic Pentesting - 2 - CTF Walkthrough - Infosec Resources
9 pages
S1 CS - U4 Data Ranges - Frequencies - Shifting
No ratings yet
S1 CS - U4 Data Ranges - Frequencies - Shifting
24 pages
Hbase PPT PDF
No ratings yet
Hbase PPT PDF
100 pages
Guardmaster Configurable Safety Relay: User Manual
No ratings yet
Guardmaster Configurable Safety Relay: User Manual
190 pages
Big Data Stream Processing Guide
No ratings yet
Big Data Stream Processing Guide
22 pages
Data Warehousing Components - L3 - L4 - L5
No ratings yet
Data Warehousing Components - L3 - L4 - L5
26 pages
Unit II: Software Requirement Analysis and Specifications
No ratings yet
Unit II: Software Requirement Analysis and Specifications
64 pages
Unit 01
No ratings yet
Unit 01
36 pages
CS8091 - Big Data Analytics - Unit 1
No ratings yet
CS8091 - Big Data Analytics - Unit 1
28 pages
Java Interfaces Explained
No ratings yet
Java Interfaces Explained
6 pages
Learning Guide: Tour Service Level III
No ratings yet
Learning Guide: Tour Service Level III
35 pages
Features of MapReduce
No ratings yet
Features of MapReduce
4 pages
Data Analytics Unit 3
No ratings yet
Data Analytics Unit 3
14 pages
Bigdata Unit-Ii
No ratings yet
Bigdata Unit-Ii
33 pages
Unit-II BDA
No ratings yet
Unit-II BDA
19 pages
DBMS Unit-1 PPT 1.1 (Introduction, Drawback of File Sysstem, View of Data)
100% (1)
DBMS Unit-1 PPT 1.1 (Introduction, Drawback of File Sysstem, View of Data)
4 pages
Algorithms Flowcharts Notes
100% (4)
Algorithms Flowcharts Notes
4 pages
2 SpringMvc
No ratings yet
2 SpringMvc
168 pages
Unit 2
No ratings yet
Unit 2
10 pages
Distributed DBMS Fundamentals
No ratings yet
Distributed DBMS Fundamentals
25 pages
BDA Presentations Unit-4 - Hadoop, Ecosystem
100% (1)
BDA Presentations Unit-4 - Hadoop, Ecosystem
25 pages
MRL3702 Examination On 2023 EF On
No ratings yet
MRL3702 Examination On 2023 EF On
10 pages
1.disabling Interrupts:: Mutual Exclusion With Busy Waiting
No ratings yet
1.disabling Interrupts:: Mutual Exclusion With Busy Waiting
2 pages
Data Flow Diagram Guide & Components
100% (1)
Data Flow Diagram Guide & Components
5 pages
DCCN
No ratings yet
DCCN
48 pages
MulesoftDevLead-Resume
No ratings yet
MulesoftDevLead-Resume
11 pages
DataWarehouseMining Complete Notes
No ratings yet
DataWarehouseMining Complete Notes
55 pages
COA Notes for BCA Students
No ratings yet
COA Notes for BCA Students
83 pages
Introduction To Big Data: Types of Digital Data, History of Big Data Innovation
No ratings yet
Introduction To Big Data: Types of Digital Data, History of Big Data Innovation
12 pages
ADBMS Lab Manual
No ratings yet
ADBMS Lab Manual
33 pages
Activation Functions
No ratings yet
Activation Functions
6 pages
Manual Cooler CNPS9500A LED
No ratings yet
Manual Cooler CNPS9500A LED
9 pages
DWM Question Bank
No ratings yet
DWM Question Bank
3 pages
Chapter - 1 Introduction
No ratings yet
Chapter - 1 Introduction
22 pages
Black Box Testing
No ratings yet
Black Box Testing
3 pages
VTU Exam Question Paper With Solution of 18MCA51 Programming Using C#.NET Jan-2021-Ms Uma B
No ratings yet
VTU Exam Question Paper With Solution of 18MCA51 Programming Using C#.NET Jan-2021-Ms Uma B
37 pages
SCT - QB - Anwers - p1
No ratings yet
SCT - QB - Anwers - p1
53 pages
Data Validation vs. Verification Guide
No ratings yet
Data Validation vs. Verification Guide
16 pages
Ict Chapter 1-4
No ratings yet
Ict Chapter 1-4
9 pages
Unit 2 BDA
No ratings yet
Unit 2 BDA
32 pages
DBMS Unit 4
No ratings yet
DBMS Unit 4
20 pages
Unit 2 Topic 5 Developing A Map Reduce Application
No ratings yet
Unit 2 Topic 5 Developing A Map Reduce Application
52 pages
Digital Voting Machine
No ratings yet
Digital Voting Machine
8 pages
Mining Social Network Graphs
No ratings yet
Mining Social Network Graphs
35 pages
Unit 3 Big Data MCQ AKTU: Royal Brinkman Gartenbaubedarf
No ratings yet
Unit 3 Big Data MCQ AKTU: Royal Brinkman Gartenbaubedarf
17 pages
Failure Classification in DBMS
No ratings yet
Failure Classification in DBMS
2 pages
Data Normalisation
No ratings yet
Data Normalisation
10 pages
5.1 Mining Data Streams
No ratings yet
5.1 Mining Data Streams
16 pages
DBMS Question Bank
No ratings yet
DBMS Question Bank
4 pages
Carrier Objective: Linkedin
No ratings yet
Carrier Objective: Linkedin
3 pages
DBMS Unit 1 Notes
100% (1)
DBMS Unit 1 Notes
22 pages
Data Stream Sampling Techniques
No ratings yet
Data Stream Sampling Techniques
3 pages
First Hop Redundancy Protocols (Ch9)
No ratings yet
First Hop Redundancy Protocols (Ch9)
6 pages
Computer Architecture Overview
No ratings yet
Computer Architecture Overview
4 pages
BDA Notes
No ratings yet
BDA Notes
96 pages
3.4.5 Packet Tracer - Configure Trunks
No ratings yet
3.4.5 Packet Tracer - Configure Trunks
2 pages
Unit V Big Data Analytics
No ratings yet
Unit V Big Data Analytics
47 pages
The Impact of Social Media On Society
No ratings yet
The Impact of Social Media On Society
2 pages
Math 9 Final Review
No ratings yet
Math 9 Final Review
26 pages
MSC Datascience Unit1
No ratings yet
MSC Datascience Unit1
20 pages
What Kind of Data Can Be Mined
No ratings yet
What Kind of Data Can Be Mined
6 pages
Module 4 Nosql
No ratings yet
Module 4 Nosql
8 pages
Parallel Database Systems
No ratings yet
Parallel Database Systems
17 pages
A New PWM Controller With One Cycle Response
No ratings yet
A New PWM Controller With One Cycle Response
7 pages
Se Unit 3
100% (1)
Se Unit 3
21 pages
Spark Streaming for Data Engineers
No ratings yet
Spark Streaming for Data Engineers
22 pages
Maven "Convention Over Configuration" Example: An Illustration of This Notion Inside The Maven
No ratings yet
Maven "Convention Over Configuration" Example: An Illustration of This Notion Inside The Maven
2 pages
Institutionalizing Modular Adaptable Ship Technologies
No ratings yet
Institutionalizing Modular Adaptable Ship Technologies
19 pages
S1 Ict End of Year
No ratings yet
S1 Ict End of Year
3 pages
CS3492 Database Management Systems Lecture Notes 2
100% (1)
CS3492 Database Management Systems Lecture Notes 2
170 pages
Part1 New Syllabus
No ratings yet
Part1 New Syllabus
4 pages
CNNs Explained for Tech Enthusiasts
No ratings yet
CNNs Explained for Tech Enthusiasts
24 pages
DBMS Question Bank and Assignment - 1
No ratings yet
DBMS Question Bank and Assignment - 1
1 page
O.R - Unit - I, II, III
No ratings yet
O.R - Unit - I, II, III
44 pages
Unit 1 Bda Complete Notes
No ratings yet
Unit 1 Bda Complete Notes
15 pages
Case Study On Dbms & Rdbms
No ratings yet
Case Study On Dbms & Rdbms
36 pages
Datasheet BNI005W 221621 en
No ratings yet
Datasheet BNI005W 221621 en
2 pages
DWDM Unit 1 (R23)
No ratings yet
DWDM Unit 1 (R23)
85 pages
3manager WhitePaper DataCollector
No ratings yet
3manager WhitePaper DataCollector
26 pages
Module 3
No ratings yet
Module 3
43 pages
Apache - Kafka Notes
No ratings yet
Apache - Kafka Notes
9 pages
High Speed Networks and Internets 2nd Edition William Stallings PDF Download
100% (4)
High Speed Networks and Internets 2nd Edition William Stallings PDF Download
61 pages
CCS341 Data Warehousing Syllabus
No ratings yet
CCS341 Data Warehousing Syllabus
2 pages
Reimagining Semiconductor Development Machine Learning Applications From Device Physics To System Architectures Survey Paper
No ratings yet
Reimagining Semiconductor Development Machine Learning Applications From Device Physics To System Architectures Survey Paper
8 pages
PG-I (I-Sem Syllabus)
No ratings yet
PG-I (I-Sem Syllabus)
6 pages

BDA Unit-4

Uploaded by

BDA Unit-4

Uploaded by

Unit-4: Mining Data Stream

Introduction Data Stream Mining

Fig. Data Stream Mining

Characteristics of Data Stream Mining

Stream Data & Processing

Benefit of Stream Processing

1. Able to deal with never-ending streams of events

2. Real-time or Near-Real-time Processing

3. Detecting Patterns in time-series data

4. Easy Data Scalability

Streaming Architecture Components

The Message Broker / Stream Processor

Figure 3 - Message Broker / Stream Processor

Batch and Real-time ETL tools

Figure 4 - Batch and Real-time ETL tools

Data Analytics / Serverless Query Engine

Streaming Data Storage

Figure 5 - Stream Computing

Counting Distinct Elements in Stream

Counting Oneness in a Window

 Here, c is usually a small constant of the order.

Real Time Analytics Platform Application

Real-time RTAP Application

Real Time Sentiment Analysis

 Real-time sentiment analysis is an AI-powered solution to track mentions of your brand

Figure 8 - Sentiment Analysis Process

 Benefits of performing real time sentiment analysis:

Case Study – Stock Market Predictions

You might also like