Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
13 views24 pages

Big Data Analytics Module 4 Mumbai University

The document discusses Data Streams and Real-Time Analytics in Big Data, emphasizing the continuous flow of data and immediate processing for analytics. It details the architecture of Data Stream Management Systems (DSMS) and compares it with traditional Database Management Systems (DBMS), highlighting challenges like unbounded data, low latency, and fault tolerance. Additionally, it explains the Bloom Filter algorithm, a probabilistic data structure for set membership testing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views24 pages

Big Data Analytics Module 4 Mumbai University

The document discusses Data Streams and Real-Time Analytics in Big Data, emphasizing the continuous flow of data and immediate processing for analytics. It details the architecture of Data Stream Management Systems (DSMS) and compares it with traditional Database Management Systems (DBMS), highlighting challenges like unbounded data, low latency, and fault tolerance. Additionally, it explains the Bloom Filter algorithm, a probabilistic data structure for set membership testing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

MODULE 4 BDA

Q1. Discuss the Data Stream, how Big Data deal with Real time Data Analytics.

->A data stream is a continuous, never-ending flow of data, usually coming in real-time from sources like
sensors, applications, or online systems.

Unlike traditional data (which is stored and then processed), data streams are processed on the fly — meaning
as soon as data comes in, we try to analyze it immediately.

3. Examples of Data Streams:

• Twitter Feed: Tweets coming in every second.

• Stock Market Data: Live updates of stock prices.

• Smart Home Devices: Sensors sending data continuously (temperature, motion, etc.).

• Website Clicks: Logging user clicks in real-time for personalization.

4. What is Real-Time Analytics in Big Data?

Real-Time Analytics means collecting, processing, and analyzing data immediately as it comes in — with
almost zero delay.

It’s like:

• Watching a cricket match live vs watching a recording.

• You need to respond to what is happening right now.


Key idea: Don’t wait to store and then analyze. Analyze as it comes!

How Big Data Deals with Real-Time Analytics?

Let’s understand the steps + tools used by Big Data in real-time systems:

A. Data Ingestion (Capturing Real-Time Data)

This is the step where the system receives continuous data.

Tool: Apache Kafka

• Kafka is a message broker.

• It collects data from multiple sources (like mobile apps, sensors, or websites).

• Sends the data to real-time processors.

Example: User clicks on Flipkart → Kafka collects it instantly.

B. Stream Processing (Processing the Data Quickly)

Once data is captured, it must be analyzed on the fly.

Tools Used:

Example:

• Spark checks all tweets in the last 10 seconds.

• It finds the top trending hashtags instantly.

C. Storage (Optional for future reference)

Not all data is stored, but important summaries or outputs might be saved.

Tools: HBase, Cassandra, Amazon S3

Example: Save only the final dashboard data, not every click.

D. Visualization / Dashboard (Real-Time Reports)


Tools: Apache Superset, Grafana, Power BI (streaming)

Example: Flipkart admin sees in real-time:

• How many users are online

• Which products are selling the most

Q2. Explain DSMS Architecture in Detail.


What is DSMS?

A Data Stream Management System (DSMS) is a system designed to handle continuous, real-time data
streams, as opposed to the traditional Database Management System (DBMS) which manages persistent data
stored in files or databases.

Unlike DBMS, where queries are executed once on stored data, DSMS continuously executes queries on
incoming data streams.

Why DSMS?

• To handle high-velocity data.

• To perform real-time analytics and monitoring.

• Useful in domains like IoT, finance, telecommunications, social media, and sensor networks.
DSMS Architecture – Detailed Component-Wise Explanation

1. Streaming Inputs (Data Sources)

Theory:

Data in DSMS comes from dynamic, time-varying sources such as:

• Sensors (e.g., temperature, motion)

• IoT Devices

• Social media feeds

• Clickstreams

• Financial tickers

These data points are often unordered, arrive continuously, and may be noisy or redundant.

2. Input Monitor / Stream Input Manager

Theory:

Responsible for:

• Accepting and validating incoming data.

• Adding timestamps (if not present).

• Applying preprocessing like filtering or deduplication.

• Ensuring synchronization when multiple streams are used.

This step is crucial for accurate query results.

3. Storage Layers
These storage types are optimized for speed and short-term retention, not long-term storage like in DBMS.

a. Working Storage

• Stores recent incoming tuples temporarily.

• Used in sliding or tumbling windows for continuous computation.

• Example: Last 1 minute of stock prices.

b. Summary Storage

• Keeps aggregated values or summarized data.

• Helps reduce processing time.

• Example: Average temperature per hour.

c. Static Storage

• Stores reference/static data (non-streaming).

• Used for joining with dynamic stream data.

• Example: List of sensor IDs and their locations.

4. Query Management System

This layer enables writing, storing, optimizing, and executing continuous queries.

a. Query Compiler

• Transforms user-defined queries into logical plans.

• Performs syntax checking, optimization, and translation to internal form.

• Continuous queries can be written in Stream SQL, CQL, or DSMS-specific languages.

b. Query Repository

• Stores registered continuous queries, metadata, and definitions.

• Enables reuse, modification, and rollback of queries.

c. Query Executor / Processor

• Continuously matches incoming data with query definitions.

• Supports window operations, joins, filters, aggregations.

• Very efficient and low-latency.

• Uses working, summary, and static storages.

5. Output Buffer / Query Result Queue

Theory:

• Results of continuous queries are not immediately delivered.


• Instead, they are buffered and possibly batched or prioritized.

• This stage prevents output bottlenecks during high loads.

6. Client Application / User Interface

Theory:

• Receives query output in real-time.

• May display results on a dashboard, trigger alerts, or send data to another system.

• Can also modify or add queries dynamically.

Working Cycle in DSMS

1. Data continuously flows in.

2. Input monitor tags and formats the data.

3. Temporary storages manage and organize data.

4. Queries are compiled and stored.

5. Query processor continuously executes the logic.

6. Results are buffered and sent to the client application.

Q3 Explain DSMS vs DBMS Architecture.


Q4 Discuss various Issues in Data Streaming

1. Unbounded and Infinite Data

What It Means:

Streaming systems process data that keeps coming forever — it doesn’t end like batch data. For example,
sensor data, logs, or stock market updates keep flowing 24/7.

Why It's a Problem:

• You can’t store everything because memory and disk are limited.

• You can’t wait for stream to end before processing because it never ends.

Solution:

• Use windowing techniques (like tumbling, sliding, or session windows) to split infinite data into finite
chunks.

• Helps in aggregation, filtering, and reporting in real-time without storing entire stream.

2. Low Latency Requirements (Real-Time Processing)

What It Means:

Many applications (like fraud detection, live dashboards, IoT alerting) need instant processing and response,
sometimes within milliseconds.

Why It's a Problem:


• Even slight delay can make decisions useless or dangerous (e.g., fraud already done, vehicle crash not
prevented).

• Traditional DBMS systems have higher latency and are not built for this speed.

Solution:

• Use low-latency streaming engines like Apache Flink, Apache Storm, or Kafka Streams.

• These engines support parallelism, asynchronous processing, and backpressure control to maintain
speed.

3. Out-of-Order and Late-Arriving Data

What It Means:

In streaming, data might not arrive in the correct order due to network delays, retries, or buffering.

Why It's a Problem:

• Can cause incorrect analytics if system processes late data after the window is closed.

• Time-based computations like "last 5 minutes average" will be wrong if older events arrive late.

Solution:

• Use event time (timestamp of when event occurred), not just processing time.

• Implement watermarks — a way to define how late data can still be accepted.

• Allow some buffer for late data to be included in correct window.

4. Memory and Storage Limitations

What It Means:

Streaming systems often hold data temporarily in memory for fast processing.

Why It's a Problem:

• Continuous data at high speed can fill memory quickly.

• System may crash or hang if memory overflows.

• You can’t store all unbounded data on disk either.

Solution:

• Use memory-efficient buffers, TTL (Time to Live) for old data, and spill to disk techniques.

• Only retain what is needed — like last few minutes or important fields.

• Use stream compression and compaction when needed.

5. Continuous Query Execution

What It Means:
Queries in DSMS run continuously (not just once like DBMS). They process live streams and give real-time
results continuously.

Why It's a Problem:

• Poorly written queries can consume excessive CPU, memory, and bandwidth.

• Some queries may grow stateful over time and slow down the system.

Solution:

• Use incremental computation (process only what's new).

• Optimize dataflows and use compiled query plans.

• Keep queries lightweight and stateless when possible, or periodically clear state.

6. Fault Tolerance and Recovery

What It Means:

In large distributed systems, things will go wrong — servers crash, power cuts, internet fails, etc.
But the system should not lose data or break.

Why It's a Problem:

• If there’s no backup, streamed data is lost forever — there’s no second chance.

• Even a single crash can lead to partial outputs or corrupted state.

Solution:

• Checkpointing: Save the system's state periodically. If it crashes, resume from last checkpoint.

• Replay Logs: Kafka and similar systems keep offsets and logs so failed data can be replayed.

• Replication: Duplicate critical tasks/nodes so that another can take over instantly during failure.

• Example: Apache Flink supports exactly-once semantics through these methods.

7. High Volume and Velocity Handling

What It Means:

Streaming systems must handle millions of events per second — especially in IoT, ad tech, finance, etc.

Why It's a Problem:

• If the system can't keep up, it lags, drops messages, or crashes.

• Hardware might choke due to CPU or I/O bottlenecks.

Solution:

• Use horizontal scaling (add more nodes to share load).

• Use partitioning of streams (Kafka partitions, Flink parallelism).

• Use load balancing and backpressure control.


• Use cloud-native streaming services like AWS Kinesis, Azure Stream Analytics for elastic scaling.

8. Data Integration and Heterogeneity

What It Means:

Streams come from different sources – IoT sensors, logs, mobile apps, APIs – and in varied formats (JSON,
CSV, binary, etc).

Why It's a Problem:

• Integrating such data on-the-fly is hard.

• Schema mismatches or format errors may crash stream pipeline.

Solution:

• Use data adapters or connectors (like Apache NiFi, Kafka Connect).

• Maintain a Schema Registry to enforce format consistency.

• Use real-time ETL tools (like Apache Beam) to transform and normalize incoming data.

9. Security and Privacy Concerns

What It Means:

Streaming data may include sensitive user data (credit card info, user behavior, sensor location).

Why It's a Problem:

• If stream gets intercepted, data breaches or compliance violations may happen (GDPR, HIPAA).

Solution:

• Use end-to-end encryption (TLS/SSL) for streams.

• Apply authentication and role-based access control (RBAC).

• Apply data masking or anonymization for personal info.

10. Accuracy vs Performance Trade-Off

What It Means:

Streaming systems must balance accuracy with performance — especially under high load.

Why It's a Problem:

• Exact processing of every event takes time → increases latency.

• Fast approximate results might be inaccurate.

Solution:

• Use approximate algorithms like Bloom Filters, HyperLogLog, Count-Min Sketch.


• For example, getting “approximate unique users” is often “good enough” and way faster.

Q5 Explain Bloom Filter Algorithm in detail with Example

A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an item is a member of
a set. The bloom filter will always say yes if an item is a set member. However, the bloom filter might still say yes
although an item is not a member of the set (false positive). The items can be added to the bloom filter but the
items cannot be removed. The bloom filter supports the following operations:

• adding an item to the set

• test the membership of an item in the set


1. Accept the input

The first step is to accept the input. In our example, let’s assume that the input is a string containing the text “John
Doe.”

2. Calculate the hash value

Next, the algorithm performs hashing to convert John Doe into a corresponding numerical value. For the sake of
our example, let’s assume that the value is 1355. The actual value is computed as per hashing algorithms, which
vary in complexity.

3. Mod the hash by the array length

The next step is to mod the hash value by the length of the array (mod is how you find and store the remainder of
a division problem). Mod in programming is denoted by %. When we perform the mod operation to John Doe or
1355, we get an index within the bounds of the bit array.
1355%19 = 6

4. Insert the hash

We insert the hash into the mod value of the array. Therefore, the sixth position in the array goes from 0 to 1.

5. Search for the value (i.e., lookup)

Steps 2 and 3 are performed again as part of the lookup process. This time, the algorithm checks the content of
the array as per the mod results. If the value is 0, the input cannot conceivably belong to the set. Nonetheless, if
the bit is 1, the input may be an element of a set. The operation (e.g., setting a password or creating an email ID) is
allowed only when the output comes as 0.

Let’s create a Bloom Filter with:

• Bit array size m = 10 (just for simplicity)

• Number of hash functions k = 3

Step 1: Bit Array Initialization

We start with a bit array of size 10, all bits set to 0:

Bit Array (initial):

[0 0 0 0 0 0 0 0 0 0]

Indexes: 0123456789

Insert Element: "mango"

Let’s say our 3 hash functions give the following values:

• h1("mango") = 1

• h2("mango") = 4

• h3("mango") = 7

Set bits at positions 1, 4, and 7 to 1:

Bit Array (after inserting "mango"):

[0 1 0 0 1 0 0 1 0 0]

Indexes: 0123456789

Insert Element: "banana"

Hash values:

• h1("banana") = 2

• h2("banana") = 4

• h3("banana") = 8

Set bits at positions 2, 4, and 8 to 1:


Bit Array (after inserting "banana"):

[0 1 1 0 1 0 0 1 1 0]

Indexes: 0123456789

Insert Element: "apple"

Hash values:

• h1("apple") = 1

• h2("apple") = 3

• h3("apple") = 5

Set bits at positions 1, 3, and 5 to 1:

Bit Array (after inserting "apple"):

[0 1 1 1 1 1 0 1 1 0]

Indexes: 0123456789

Now, Check Element: "mango"

Hash values (again):

• h1("mango") = 1

• h2("mango") = 4

• h3("mango") = 7

All bits are set to 1 → “mango” is probably in the set

Check Element: "grapes"

Hash values:

• h1("grapes") = 2

• h2("grapes") = 4

• h3("grapes") = 9 ← bit 9 is 0

Since bit 9 = 0, we say → “grapes” is definitely not in the set

Check Element: "orange"

Hash values:

• h1("orange") = 1

• h2("orange") = 4
• h3("orange") = 8

All bits = 1 → Bloom Filter says: “orange is probably in the set”,


but we never inserted it. So this is a false positive

Q6 Explain FM Algorithm in detail with Example.

https://w0077w.ques10.com/p/42324/explain-flajolet-martin-algorithm-with-example/

What is the FM Algorithm?

The Flajolet–Martin Algorithm is a probabilistic algorithm used to estimate the number of distinct elements
(cardinality) in a data stream using very little memory.

It doesn’t give the exact count — but gives an approximate value using hash functions and bit patterns.

Why Do We Need FM?

In big data or streaming environments, storing the full dataset in memory isn’t possible.
So, FM helps estimate:

• Number of unique IP addresses

• Number of distinct users

• Number of unique searches ...and more, with just a few bits of memory.

Core Idea

The key idea is:

The position of the rightmost 1-bit in the binary representation of a hash tells us something about the likelihood
of seeing that element.

The more unique elements we see, the greater the chance we’ll see hash values with more trailing zeros.

Steps of the FM Algorithm

1. Hash each element to a binary number (using a good hash function).

2. For each hashed value, count number of trailing 0s in the binary string.

3. Keep track of the maximum number of trailing 0s seen so far.

4. Estimate the number of distinct elements using:

Estimate= 2^R
Where R = maximum number of trailing 0s.

To improve accuracy, we run the algorithm multiple times with different hash functions and take the average.

Example: Let’s See It in Action

Stream:

[dog, cat, dog, cat, elephant, tiger, lion, cat]

Goal: Estimate number of distinct elements

Step 1: Apply Hash Function

Let’s hash each item to a 8-bit binary string using a hypothetical hash:
Q7 Explain DGIM Algorithm in detail with Example. https://youtu.be/uFKWc2YR5MU?si=uuj1R2dgWD-
mv8i6

https://medium.com/fnplus/dgim-algorithm-
169af6bb3b0c#:~:text=In%20DGIM%20algorithm%2C%20each%20bit,as%20a%20multiple%20of%202)
.

The DGIM Algorithm is a streaming algorithm that efficiently counts the number of 1s in the last k bits of a
data stream — using very little memory.
Why Do We Use DGIM?

In real-time systems like:

• Network traffic analysis

• Monitoring binary event logs

• IoT sensor data (e.g., binary sensor on/off)

...you often want to count how many 1s appeared in the last k bits of a huge stream (e.g., last 1000 or 1 million
events).

But storing the entire stream uses too much memory, so we use DGIM to estimate the count using a clever
compression trick.

How Does DGIM Work?

DGIM stores only logarithmic number of buckets (compressed form of 1s).

Each bucket represents a group of 1s. Each bucket has:

• Size (power of 2, like 1, 2, 4, 8…)

• Timestamp (when the last 1 in the bucket appeared)

Rules:

1. Buckets have only 1, 2, or at most 2 of same size.

2. When more than 2 buckets of same size appear → merge two oldest into one of double size.

3. We discard old buckets beyond the last k bits.

Example of DGIM Algorithm

Stream:

We get a binary stream (bit by bit from right to left):

Stream (recent to old): 1 0 1 1 0 1 0 0 1 1 1 0 0 1

Index (position): 0 1 2 3 4 5 6 7 8 9 10 11 12 13

Suppose we want to count 1s in last k = 10 bits.

So, we care only about bits at positions 0 to 9:

Relevant window: [1 0 1 1 0 1 0 0 1 1]

Index: 0 1 2 3 4 5 6 7 8 9

Step-by-step Bucket Creation:

We go from right to left (from index 0 up to 9) and create buckets for 1s.
Q 8 Discuss Sampling Algorithm. Explain reservoir Sampling in detail.

You might also like