Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
10 views6 pages

BDA Answers

Bda

Uploaded by

NOAH W
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views6 pages

BDA Answers

Bda

Uploaded by

NOAH W
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

PART-A

Q1: Illustrate use of MapReduce in Hadoop to perform a word count on a given specified
dataset. (5 Marks)

Answer:

MapReduce is a programming model used for parallel and distributed data processing on
large datasets in Hadoop. It consists of two main phases: Map and Reduce.

Steps to perform Word Count using MapReduce:

1. Input Splitting:
o The dataset (text file) is divided into splits (blocks), which are distributed across
different Hadoop DataNodes.
2. Mapping Phase:
o Each Mapper processes one split of data.
o For every word in the split, the mapper emits a key-value pair:
o <word, 1>
3. Shuffling and Sorting:
o All the values for the same key (word) are grouped together and sent to the
appropriate Reducer.
o Example: All <hadoop, 1> pairs go to the same reducer.
4. Reducing Phase:
o The Reducer sums up all the values for a given key.
o Example:
o <hadoop, 1> <hadoop, 1> <hadoop, 1>
o → <hadoop, 3>
5. Output:
o Final output contains the word counts stored in HDFS.

Example:
Input File:

Hadoop is fast
Hadoop is scalable

 Mapper Output:

Hadoop,1
is,1
fast,1
Hadoop,1
is,1
scalable,1
 Reducer Output:

Hadoop,2
is,2
fast,1
scalable,1

Q2: Write a note on Parallel Data Processing and Distributed Data Processing. (5 Marks)

Answer:

1. Parallel Data Processing

 Definition: Processing data simultaneously using multiple processors or cores in a


single machine.
 Key Features:
o Data is divided into smaller tasks and processed in parallel.
o Uses shared memory.
 Example: Running a Spark job on a single powerful machine with multi-core CPUs.
 Advantages:
o High speed for small-to-medium data.
o Efficient CPU utilization.
 Limitation: Scalability is limited to a single machine.

2. Distributed Data Processing

 Definition: Processing data across multiple nodes or machines in a cluster.


 Key Features:
o Data is divided and distributed across the cluster nodes.
o Each node processes its part of the data and results are combined.
 Example: Hadoop MapReduce and Spark on a multi-node cluster.
 Advantages:
o Can handle big data that does not fit in a single machine.
o Fault-tolerant due to replication (HDFS).
 Limitation: Network overhead may occur.

Comparison Table:

Feature Parallel Processing Distributed Processing


System Type Single machine Multiple machines (Cluster)
Memory Shared memory Distributed memory
Scalability Limited Highly scalable
Examples OpenMP, Multithreading Hadoop, Spark
PART-B

Q3: How the processing of workloads is performed in Big Data? Which are the different
types of workloads? Illustrate with examples. (10 Marks)

Answer:

Big Data workloads are processed in a distributed environment like Hadoop or Spark using
the following steps:

1. Data Ingestion:
o Collecting data from various sources such as IoT devices, social media, logs.
2. Data Storage:
o Store raw data in HDFS, NoSQL databases or cloud storage.
3. Data Processing:
o Process data using batch, real-time, or interactive processing frameworks.
4. Workload Execution:
o Data is divided into tasks, distributed across nodes, processed in parallel, and
combined for results.

Types of Big Data Workloads:

1. Batch Processing Workload


o Description: Large volumes of data are collected and processed periodically.
o Tool: Hadoop MapReduce.
o Example: Monthly sales report generation.
2. Real-Time/Streaming Workload
o Description: Continuous processing of data as it arrives.
o Tool: Apache Kafka, Apache Flink, Spark Streaming.
o Example: Fraud detection in credit card transactions.
3. Interactive Workload
o Description: Querying data in near real-time.
o Tool: Hive, Impala, Presto.
o Example: Business analyst querying sales data for specific regions.
4. Machine Learning Workload
o Description: Training models on large datasets.
o Tool: Spark MLlib, TensorFlow on distributed clusters.
o Example: Recommendation engines (Amazon, Netflix).

Illustration:

Data Source → Ingestion → Storage (HDFS) → Batch/Real-time Processing →


Result/Visualization

Q4: Compare Transactional Processing and Batch Processing with the help of neat
diagrams. (10 Marks)
Answer:

1. Transactional Processing (OLTP - Online Transaction Processing):

 Processes individual transactions quickly.


 Focused on data consistency and real-time updates.
 Example: Banking system (Money transfer).

Characteristics:

 Small data per transaction.


 Real-time response.
 Uses databases like MySQL, PostgreSQL.

2. Batch Processing (OLAP - Online Analytical Processing):

 Processes large volumes of data in batches.


 Focused on analysis and historical data processing.
 Example: Daily sales report generation.

Characteristics:

 Large data blocks.


 High throughput, but delayed response.
 Uses Hadoop, Spark, Hive.

Comparison Table:

Feature Transactional (OLTP) Batch (OLAP)


Processing type Real-time Scheduled/Delayed
Data volume Small per transaction Large dataset
Examples ATM withdrawal Payroll generation
Technology RDBMS Hadoop, Spark

Neat Diagram:

Transactional Processing:
[User Request] → [Immediate Transaction] → [Database Update]

Batch Processing:
[Data Collected Over Time] → [Process in Batch] → [Result Output]

PART-C
Q5: Compare the Big Data Storage Concepts and explain at least two of them in detail. (20
Marks)

Answer:

Big Data requires special storage systems to handle volume, velocity, and variety of data. The
main storage concepts are:

1. HDFS (Hadoop Distributed File System)


o Concept: Stores large files across clusters of commodity machines.
o Features:
 Block-based storage (128MB or 256MB blocks).
 Replication for fault tolerance (default 3 copies).
o Advantage: High throughput and fault-tolerance.
o Example: Storing logs, clickstream data.
2. NoSQL Databases
o Concept: Non-relational databases optimized for unstructured/semi-structured
data.
o Types:
 Key-Value Stores: Redis, DynamoDB
 Document Stores: MongoDB
 Columnar Stores: Cassandra, HBase
o Advantage: Scalability and flexible schema.
3. Cloud Storage
o Concept: Store and process data in cloud platforms like AWS S3, Google Cloud
Storage.
o Advantage: Elastic storage and managed infrastructure.
4. Data Lakes
o Concept: Central repository for raw structured and unstructured data.
o Advantage: Supports ML, analytics, and real-time processing.

Comparison Table:

Storage Concept Structure Example Use Case


HDFS Block Storage Hadoop Batch processing
NoSQL Database Key-Value/Doc MongoDB, HBase Real-time applications
Cloud Storage Object Storage AWS S3 Scalable cloud storage

Q6: With the help of a neat Venn Diagram, compare the Speed, Consistency, and Volume
in Big Data Analytics. Also explain which combinations are possible and which are not, and
why. (20 Marks)

Answer:
In Big Data, the three important properties are often considered as part of the CAP/Big Data
trade-offs:

1. Speed (Velocity):
o Ability to process and analyze data quickly.
o Example: Real-time fraud detection.
2. Consistency:
o Accuracy and reliability of data processing.
o Example: Banking transactions must remain consistent.
3. Volume:
o Handling very large datasets.
o Example: Social media analytics on petabytes of data.

Venn Diagram:

[Speed]
/\
/ \
/ \
[Consistency] [Volume]

Possible Combinations:

1. Speed + Consistency (Without Volume)


o Real-time OLTP systems.
o Limited data, accurate and fast.
2. Speed + Volume (Without Consistency)
o Real-time analytics where eventual consistency is acceptable.
o Example: Social media trends.
3. Volume + Consistency (Without Speed)
o Batch processing of massive data with accurate results.
o Example: Monthly payroll or census data analysis.

Not Possible:

 All three (Speed + Consistency + Volume) simultaneously is extremely difficult due to


CAP theorem limitations and resource constraints.

You might also like