PART-A
Q1: Illustrate use of MapReduce in Hadoop to perform a word count on a given specified
dataset. (5 Marks)
Answer:
MapReduce is a programming model used for parallel and distributed data processing on
large datasets in Hadoop. It consists of two main phases: Map and Reduce.
Steps to perform Word Count using MapReduce:
1. Input Splitting:
o The dataset (text file) is divided into splits (blocks), which are distributed across
different Hadoop DataNodes.
2. Mapping Phase:
o Each Mapper processes one split of data.
o For every word in the split, the mapper emits a key-value pair:
o <word, 1>
3. Shuffling and Sorting:
o All the values for the same key (word) are grouped together and sent to the
appropriate Reducer.
o Example: All <hadoop, 1> pairs go to the same reducer.
4. Reducing Phase:
o The Reducer sums up all the values for a given key.
o Example:
o <hadoop, 1> <hadoop, 1> <hadoop, 1>
o → <hadoop, 3>
5. Output:
o Final output contains the word counts stored in HDFS.
Example:
Input File:
Hadoop is fast
Hadoop is scalable
Mapper Output:
Hadoop,1
is,1
fast,1
Hadoop,1
is,1
scalable,1
Reducer Output:
Hadoop,2
is,2
fast,1
scalable,1
Q2: Write a note on Parallel Data Processing and Distributed Data Processing. (5 Marks)
Answer:
1. Parallel Data Processing
Definition: Processing data simultaneously using multiple processors or cores in a
single machine.
Key Features:
o Data is divided into smaller tasks and processed in parallel.
o Uses shared memory.
Example: Running a Spark job on a single powerful machine with multi-core CPUs.
Advantages:
o High speed for small-to-medium data.
o Efficient CPU utilization.
Limitation: Scalability is limited to a single machine.
2. Distributed Data Processing
Definition: Processing data across multiple nodes or machines in a cluster.
Key Features:
o Data is divided and distributed across the cluster nodes.
o Each node processes its part of the data and results are combined.
Example: Hadoop MapReduce and Spark on a multi-node cluster.
Advantages:
o Can handle big data that does not fit in a single machine.
o Fault-tolerant due to replication (HDFS).
Limitation: Network overhead may occur.
Comparison Table:
Feature Parallel Processing Distributed Processing
System Type Single machine Multiple machines (Cluster)
Memory Shared memory Distributed memory
Scalability Limited Highly scalable
Examples OpenMP, Multithreading Hadoop, Spark
PART-B
Q3: How the processing of workloads is performed in Big Data? Which are the different
types of workloads? Illustrate with examples. (10 Marks)
Answer:
Big Data workloads are processed in a distributed environment like Hadoop or Spark using
the following steps:
1. Data Ingestion:
o Collecting data from various sources such as IoT devices, social media, logs.
2. Data Storage:
o Store raw data in HDFS, NoSQL databases or cloud storage.
3. Data Processing:
o Process data using batch, real-time, or interactive processing frameworks.
4. Workload Execution:
o Data is divided into tasks, distributed across nodes, processed in parallel, and
combined for results.
Types of Big Data Workloads:
1. Batch Processing Workload
o Description: Large volumes of data are collected and processed periodically.
o Tool: Hadoop MapReduce.
o Example: Monthly sales report generation.
2. Real-Time/Streaming Workload
o Description: Continuous processing of data as it arrives.
o Tool: Apache Kafka, Apache Flink, Spark Streaming.
o Example: Fraud detection in credit card transactions.
3. Interactive Workload
o Description: Querying data in near real-time.
o Tool: Hive, Impala, Presto.
o Example: Business analyst querying sales data for specific regions.
4. Machine Learning Workload
o Description: Training models on large datasets.
o Tool: Spark MLlib, TensorFlow on distributed clusters.
o Example: Recommendation engines (Amazon, Netflix).
Illustration:
Data Source → Ingestion → Storage (HDFS) → Batch/Real-time Processing →
Result/Visualization
Q4: Compare Transactional Processing and Batch Processing with the help of neat
diagrams. (10 Marks)
Answer:
1. Transactional Processing (OLTP - Online Transaction Processing):
Processes individual transactions quickly.
Focused on data consistency and real-time updates.
Example: Banking system (Money transfer).
Characteristics:
Small data per transaction.
Real-time response.
Uses databases like MySQL, PostgreSQL.
2. Batch Processing (OLAP - Online Analytical Processing):
Processes large volumes of data in batches.
Focused on analysis and historical data processing.
Example: Daily sales report generation.
Characteristics:
Large data blocks.
High throughput, but delayed response.
Uses Hadoop, Spark, Hive.
Comparison Table:
Feature Transactional (OLTP) Batch (OLAP)
Processing type Real-time Scheduled/Delayed
Data volume Small per transaction Large dataset
Examples ATM withdrawal Payroll generation
Technology RDBMS Hadoop, Spark
Neat Diagram:
Transactional Processing:
[User Request] → [Immediate Transaction] → [Database Update]
Batch Processing:
[Data Collected Over Time] → [Process in Batch] → [Result Output]
PART-C
Q5: Compare the Big Data Storage Concepts and explain at least two of them in detail. (20
Marks)
Answer:
Big Data requires special storage systems to handle volume, velocity, and variety of data. The
main storage concepts are:
1. HDFS (Hadoop Distributed File System)
o Concept: Stores large files across clusters of commodity machines.
o Features:
Block-based storage (128MB or 256MB blocks).
Replication for fault tolerance (default 3 copies).
o Advantage: High throughput and fault-tolerance.
o Example: Storing logs, clickstream data.
2. NoSQL Databases
o Concept: Non-relational databases optimized for unstructured/semi-structured
data.
o Types:
Key-Value Stores: Redis, DynamoDB
Document Stores: MongoDB
Columnar Stores: Cassandra, HBase
o Advantage: Scalability and flexible schema.
3. Cloud Storage
o Concept: Store and process data in cloud platforms like AWS S3, Google Cloud
Storage.
o Advantage: Elastic storage and managed infrastructure.
4. Data Lakes
o Concept: Central repository for raw structured and unstructured data.
o Advantage: Supports ML, analytics, and real-time processing.
Comparison Table:
Storage Concept Structure Example Use Case
HDFS Block Storage Hadoop Batch processing
NoSQL Database Key-Value/Doc MongoDB, HBase Real-time applications
Cloud Storage Object Storage AWS S3 Scalable cloud storage
Q6: With the help of a neat Venn Diagram, compare the Speed, Consistency, and Volume
in Big Data Analytics. Also explain which combinations are possible and which are not, and
why. (20 Marks)
Answer:
In Big Data, the three important properties are often considered as part of the CAP/Big Data
trade-offs:
1. Speed (Velocity):
o Ability to process and analyze data quickly.
o Example: Real-time fraud detection.
2. Consistency:
o Accuracy and reliability of data processing.
o Example: Banking transactions must remain consistent.
3. Volume:
o Handling very large datasets.
o Example: Social media analytics on petabytes of data.
Venn Diagram:
[Speed]
/\
/ \
/ \
[Consistency] [Volume]
Possible Combinations:
1. Speed + Consistency (Without Volume)
o Real-time OLTP systems.
o Limited data, accurate and fast.
2. Speed + Volume (Without Consistency)
o Real-time analytics where eventual consistency is acceptable.
o Example: Social media trends.
3. Volume + Consistency (Without Speed)
o Batch processing of massive data with accurate results.
o Example: Monthly payroll or census data analysis.
Not Possible:
All three (Speed + Consistency + Volume) simultaneously is extremely difficult due to
CAP theorem limitations and resource constraints.