Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
19 views48 pages

Data Analytics (Unit-03) - 7777

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views48 pages

Data Analytics (Unit-03) - 7777

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

DATA ANALYTICS (UNIT-03)

Name-varundeep singh
Introduction to the Stream Concept
Data streams are continuous, unbounded, and high-speed flows of data
generated in real-time by various sources, such as sensors, social media,
network traffic, or transaction logs. Mining data streams refers to the process
of extracting meaningful patterns, insights, or knowledge from this ongoing
flow of data.
Key Characteristics of Data Streams
1.Continuous Flow: Data arrives in real-time and cannot be "paused" for
analysis.
2.Unbounded: Unlike traditional datasets, the size of a data stream is
theoretically infinite.
3.High-Speed: Data is generated and transmitted at high velocity, requiring
quick processing.
stream data model and architecture
Stream computing
1. Throughput Calculation
Problem: A stream processing system receives data at a rate of 5,000
events per second. If the system processes 25,000 events in 5 seconds,
what is the throughput of the system in events per second?
Sol:- 5,000 Ans (equal to the input)
Throughput is defined as the number of events processed per second.
Latency Estimation
Problem: A stream processing system has a total latency of 200 milliseconds to
process one event. If the input rate is 1,000 events per second, how many events
can the system process in one second without exceeding the processing
capacity?
Sol:- Ans (The system can process 5 events/second based on its latency.)
The system latency is 200 milliseconds (ms) per event. Convert this to seconds:
Latency per Event=200 ms = 0.2 seconds
The number of events the system can process in one second is:
Windowing Operations
Problem: A tumbling window of 5 seconds is applied to a data stream. If the input rate is
1,000 events per second, how many events are processed in each window?
Solution: The number of events processed in each window is:

Solution: Ans:- (Each window processes 5,000 events.)


The number of events processed in each window is:
Sliding Window Operations
Problem: A sliding window of size 10 seconds with a slide interval of 2
seconds is applied to a data stream. If the input rate is 500 events per
second, how many windows overlap at any given time?

Sol: (At any given time, 5 windows overlap.)


The overlap factor is calculated as:
Fault Tolerance and Checkpointing
Problem: A stream processing system performs checkpointing every 30
seconds. If the system crashes at 85 seconds, how many events are lost if the
input rate is 2,000 events per second?

Sol: Ans :- (The system loses 50,000 events.)


Checkpointing occurs every 30 seconds, so the last checkpoint before the
crash was at 60 seconds.
The time since the last checkpoint is:
Time since Last Checkpoint = 85 − 60 = 25seconds
The number of events lost is:
Resource Allocation
Problem: A stream processing task requires 2 GB of memory per 1,000
events/second. If the input rate is 15,000 events/second, how much
memory is required?

Sol: Ans:-(The task requires 30 GB of memory.)


The memory requirement scales linearly with the input rate:
Event-Time vs. Processing-Time Lag
Problem: Events in a stream have an average event-time delay of 2 seconds. If the
system processes events with an average latency of 3 seconds, what is the total lag
between event generation and processing?

Sol: Ans:(The total lag is 5 seconds.)


The total lag is the sum of event-time delay and processing latency:
sampling data in stream
Sampling in data streams involves selecting a subset of data points from the
continuous flow for analysis or estimation.

1. Reservoir sampling is a randomized algorithm used to sample a fixed-size


subset of items from a stream of data of unknown size N. It ensures that each
item in the stream has an equal probability of being included in the sample,
even though the size of the stream might not be known in advance.

Problem: A data stream has 1,000,000 events. You want to select a random
sample of 100 events using reservoir sampling. At the arrival of the 10,000th
event, what is the probability that this event is included in the reservoir?
• Sol:
In reservoir sampling, the probability of the i-th element being in the sample
at any point is:

For the 10,000th event:


p(selected) = 100/10,000
= 0.01
Answer: The probability that the 10,000th event is included in the sample is
1%.
2. Sliding Window Sampling
• Sliding Window Sampling is a technique used in stream processing to sample
data points from a continuous data stream within a "sliding window" of recent
events. The "window" moves continuously as new data arrives, retaining only
the most recent data while discarding older data.
Problem: A sliding window of size 1,000 events is used, and every 10th event is
sampled from the window. If the input rate is 2,000 events per second, how
many samples are collected per second?
• Solution:
1.Number of events in a second: 2,000 events
2.Sampling rate: Every 10th event is sampled.
Samples per Second = 2000/10 = 200 samples/second
Answer: The system collects 200 samples per second.

3. Stratified Sampling
• Stratified Sampling is a statistical technique used to ensure that a sample
accurately represents the underlying population by dividing the population
into distinct groups, or "strata," and then sampling proportionally from each
group. This method reduces sampling bias and improves the precision of
estimates compared to simple random sampling.

Problem: A stream has two strata:


Stratum A: Accounts for 30% of the stream.
Stratum B: Accounts for 70% of the stream. You want to collect 1,000 samples
from the stream using stratified sampling. How many samples should be taken
from each stratum?
• Sol:- In stratified sampling, the number of samples from each stratum is
proportional to its size.
Samples from Stratum A=Total Samples × Proportion of Stratum
Samples from Stratum A = 1,000×0.3 = 300
Samples from Stratum B=1,000×0.7 = 700

4. Sampling for Estimation


• Problem: A data stream has a total of 1,000,000 events, and you take a
random sample of 10,000 events to estimate the mean value of an attribute. If
the sample mean is 50 and the population standard deviation is 10, calculate
the 95% confidence interval for the population mean.
• Solution: The confidence interval for the mean is given by:
Where:
• xˉ=50 (sample mean),
• σ=10 (population standard deviation),
• n=10,000(sample size),
• z=1.96 (for 95% confidence level).

Answer: The 95% confidence interval is (49.804, 50.196).


5. Bernoulli Sampling
• Bernoulli Sampling is a probabilistic sampling technique where each item in a
dataset or data stream is independently selected with a fixed probability p. It
is named after Jacob Bernoulli, reflecting its foundation in Bernoulli trials,
where each event has only two possible outcomes: "selected" or "not
selected.“

• Problem: A data stream generates 10,000 events per second, and you apply
Bernoulli sampling with a probability of p=0.05. How many events are
expected to be sampled per second?
• Solution: The expected number of sampled events is:
Expected Samples per Second=p × Total Events per Second
Expected Samples per Second=0.05×10,000=500
Answer: 500 events are expected to be sampled per second.
filtering stream
Filtering a stream refers to the process of extracting or processing specific
pieces of data from a continuous flow of information (referred to as a
"stream"). This concept is commonly used in data processing, programming,
and systems that handle real-time data, such as sensor readings, log files, or
live social media feeds.

(code in jyupter)
Counting distinct element in a stream

We'll look at two approaches:


1. exact counting using a hash set
2. approximate counting using the Flajolet- Martin algorithm.

Exact Counting Using a Hash Set


• When counting distinct elements in a stream exactly, a hash set is a
perfect choice because it inherently stores only unique elements.
Here's a step-by-step explanation and example implementation:
• Algorithm:
1. Initialize an empty set (distinct_set)
2. Iterate through each element in the stream.
3. Add the element to the set (if it's already present, the set ignores it).
4. After processing the stream, the size of the set represents the number of
distinct elements.
Example
Stream: [5, 1, 2, 3, 5, 2, 1, 6, 7]

Final set:{5,1,2,3,6,7}
Count of distinct elements: 6
• Approximate Counting with Flajolet-Martin Algorithm
The Flajolet-Martin (FM) algorithm is a probabilistic method for estimating
the number of distinct elements in a stream. It is particularly memory-
efficient and is widely used for large-scale data streaming scenarios.
• Stream : [5, 1, 2, 3, 5, 2, 1, 6, 7]
• Walkthrough:
1.Use a hash function h(x) to generate hash values:
5 → 155 (binary: 10011011)
1 → 31 (binary: 00011111)
2 → 62 (binary: 00111110)
3 → 93 (binary: 01011101)
6 → 186 (binary: 10111010)
7 → 217 (binary: 11011001)
1. Hash Function Used:
In the example, we simulate a simple hash function for explanation purposes:
h(x) = (x × 31)mod 256

This function multiplies the input number by 31 and then takes the remainder when
divided by 256. The output will always be in the range [0,255] which fits within an 8-
bit binary number.
2. Find the position of the rightmost 1-bit for each hashed value:
155 → Position: 0
31 → Position: 0
62 → Position: 1
93 → Position: 0
186 → Position: 1
217 → Position: 0
counting oneness in a window in data stream
Problem
• Given a binary data stream S=[1,0,1,1,0,1,0,1] count the number of 1’s in a
sliding window of size W=4
• Step-by-Step Solution
We maintain:
1.A queue to represent the sliding window.
2.A counter to track the number of 1’s in the current window
• Algorithm
1.Initialize an empty queue and set the count of 1’s to zero.
• For each incoming data point : Add the new element to the queue.
• Increment the count if the new element is 1
• If the window size exceeds W, remove the oldest element from the queue :
Decrement the count if the removed element is 1
• The count at any point represents the number of 1’s in current window
Execution Walkthrough

Initial State:
•Window: []
•Count: 0

Process Each Element:


1.Add 1:
•Window: [1]
•Count: 1
2.Add 0:
•Window: [1, 0]
•Count: 1
3.Add 1:
•Window: [1, 0, 1]
•Count: 2
4.Add 1:
•Window: [1, 0, 1, 1]
•Count: 3
5.Add 0 (Exceeds window size, remove oldest):
•Window: [0, 1, 1, 0]
•Count: 2
6) Add 1:
Window: [1, 1, 0, 1]
Count: 3

7) Add 0:
Window: [1, 0, 1, 0]
Count: 2

8) Add 1:
Window: [0, 1, 0, 1]
Count: 2

Stream: [1, 0, 1, 1, 0, 1, 0, 1]
Window size: 4
After processing element 1 (1): Count of 1s = 1
After processing element 2 (0): Count of 1s = 1
After processing element 3 (1): Count of 1s = 2
After processing element 4 (1): Count of 1s = 3
After processing element 5 (0): Count of 1s = 2
After processing element 6 (1): Count of 1s = 3
After processing element 7 (0): Count of 1s = 2
After processing element 8 (1): Count of 1s = 2
• Many more approaches are as follows:-
Decaying window
A decaying window is an alternative to a fixed-size sliding window where
recent elements in the data stream are given more weight, and the
influence of older elements gradually "decays" over time. This is especially
useful in scenarios where you want to emphasize recent trends without
strictly limiting the window size.
Real-time Analytics platform ( RTAP)
Applications
A Real-Time Analytics Platform (RTAP) processes and analyzes data as it is
generated, enabling businesses and systems to act immediately on insights. RTAPs
are increasingly critical in industries where rapid decision-making is crucial. Here are
the main applications across various sectors:

• 1. E-Commerce and Retail


• Dynamic Pricing: Adjust prices in real-time based on demand, competitor prices,
and inventory levels.
• Personalized Recommendations: Suggest products based on real-time browsing
behavior, purchase history, and trends.
• Fraud Detection: Identify unusual transaction patterns or account behavior
instantly.
• Inventory Management: Monitor stock levels and predict restocking needs based
on live sales data.
2. Financial Services
• High-Frequency Trading: Analyze market data streams to execute trades within
milliseconds.
• Fraud Prevention: Detect suspicious transactions or fraudulent activities in real time.
• Risk Management: Continuously monitor portfolios and market conditions to manage
risks dynamically.
• Customer Analytics: Provide personalized banking services and investment
recommendations.

3. Healthcare
• Remote Patient Monitoring: Analyze real-time data from wearable devices to monitor
vital signs and detect anomalies.
• Predictive Diagnostics: Identify health risks by analyzing data from connected devices and
electronic health records (EHR).
4. Telecommunications
• Network Monitoring and Optimization: Identify and resolve network issues
proactively by analyzing live traffic data.
• Customer Experience Management: Personalize services and detect churn
risks through real-time usage analysis.
• Fraud Detection: Identify irregularities like SIM box fraud or unauthorized
account access.

• 5. IoT and Smart Cities


• Traffic Management: Monitor and control traffic flow based on data from
sensors, cameras, and GPS devices.
• Energy Management: Optimize power grid operations by analyzing live
energy consumption and production data.
• Public Safety: Detect and respond to emergencies like accidents or crimes
using data from IoT devices and cameras.
The Flajolet–Martin algorithm
Assume our multiset as M , with elements e1 …… en
Lets say we are given a function hash(e) which maps elements to integers in
uniformly distributed over range 0 to 2^L -1. “L” refers to the number of bits
required to represent the range of integers produced by the hash function
hash(e).
And we also define a function bit(y,k) which returns k th bit in the binary
number y , such that

Then we define a function p(y) which returns the least significant 1-bit of y ,
defined formally as :
• For convenience we will assume

INITIAL BITMAP=[0,0,0,0]
BITMAP = [1,0,1,1]
FILTERING STREAM IN DATA ANALYTICS
1. Rule-Based Filtering : Filters data using predefined conditions or thresholds. Rule-based filtering involves
applying predefined conditions or thresholds to a data stream to retain only the relevant data while
discarding the rest. The rules are typically based on domain knowledge or specific requirements.
2. Window-Based Filtering :- Processes data within a time-based or event-based
window. Window-based filtering processes a continuous data stream within a
time-based or event-based window to filter, aggregate, or summarize data.
Window Types:
• Sliding Window: Overlapping windows that continuously move over the data.
• Tumbling Window: Non-overlapping, fixed-size windows.
• Session Window: Defined by a period of inactivity in the stream.
Example 1: Time-Based Sliding Window
Scenario:
• Stream=[100,102,101,103,104,105] Use a sliding window of size 3 to calculate the average price in each
window.
Example 2: Event-Based Tumbling Window
• Scenario:
A sensor produces a stream of readings:
• Stream=[5,8,7,6,10,9,8] Aggregate every 3 readings into a single sum.
Example 3: Session Window
Scenario:
Website user clicks with timestamps (in seconds):
Stream=[5,10,15,35,40,45,90] Define a session window as <20 seconds gap
Group clicks with intervals < 20 seconds
Sessions={[5,10,15],[35,40,45],[90]}
3. Outlier Detection
Filters anomalous data using statistical methods.
Example:
Stream of hourly sales:
Stream=[200,205,210,800,215,220]
Identify and remove outliers using the Z-score method.
4. Aggregation-Based Filtering in
Data Analytics
Aggregation-based filtering involves
summarizing a data stream by applying
aggregation operations (e.g., sum,
average, maximum, minimum) over a
subset of the data (such as a window or
group). This helps filter the stream by
retaining only the aggregated
summaries or by applying thresholds to
these aggregated results.
Example 3: Social Media Monitoring
A stream of tweet counts is recorded
every hour:
Stream=[100,150,200,180,250,300,400,
350] Retain only hours where the
cumulative tweet count exceeds 500

You might also like