BDA PartB
BDA PartB
Unit-1
Question 1: (Remembering) What is big data, and how does it relate to the convergence of
key trends in technology and business?
Answer: Big data refers to the vast volume of structured and unstructured data that organizations
generate and collect. It encompasses three key characteristics: volume, velocity, and variety. The
convergence of big data with key trends in technology and business has led to transformative
changes in various industries.
Internet of Things Connecting devices and collecting real-time data, generating massive
(IoT) data streams.
The convergence of these trends has empowered organizations to extract valuable insights from
big data, enhance operational efficiency, and improve customer experiences, making it a pivotal
aspect of modern business strategies.
Question 2: (Understanding) Explain the concept of unstructured data and its significance
in the context of big data.
Answer: Unstructured data refers to data that lacks a predefined data model or schema. It
includes textual content, images, videos, social media posts, and more. In the context of big data,
unstructured data is significant because it constitutes a substantial portion of the data generated
daily.
Question 3: (Applying) Provide industry examples of big data applications and their
impact on business outcomes.
Answer: Big data applications have revolutionized industries, driving data-driven decision-
making and optimizing various business processes. Let's explore some industry examples and
their impact.
Marketing Web Analytics and Customer Targeted marketing and improved customer
Behavior Analysis engagement.
Big data applications empower organizations to analyze vast datasets, gain actionable insights,
and drive business growth by enhancing customer satisfaction and operational efficiency.
Question 4: (Analyzing) Evaluate the role of Hadoop in handling big data and its
advantages for businesses.
Answer: Hadoop plays a pivotal role in handling big data, providing a scalable and cost-effective
solution for data storage and processing.
Advantages of
Hadoop Description
By leveraging Hadoop, businesses can handle massive datasets effectively, gain valuable
insights, and accelerate data-driven decision-making.
Answer: Open-source technologies have had a profound impact on the big data analytics
landscape, offering a range of benefits for organizations.
Significance of Open-
source Technologies Description
The significance of open-source technologies in big data analytics lies in their ability to
democratize access to powerful tools, foster innovation, and empower organizations to harness
the full potential of big data analytics.
Data Collection and Establish a mobile app or website for users to submit photos,
Verification GPS coordinates, and data.
Data Validation and Implement a verification process to validate submitted data for
Quality Assurance accuracy and reliability.
Data Visualization and Use crowd-sourced data to create interactive maps and reports to
Analysis monitor biodiversity trends.
Impact and Community Share insights and findings with participants, fostering a sense
Engagement of ownership and community engagement.
Answer: Implementing inter and trans firewall analytics is crucial for enhancing data security in
distributed systems. Let's outline the plan for a company:
Step Description
By following this plan, the company can strengthen its data security, mitigate risks, and protect
critical data assets from unauthorized access and cyber threats.
Question 8: (Evaluating) Assess the impact of web analytics in big data applications and its
significance for digital marketing.
Answer: Web analytics plays a vital role in big data applications, enabling organizations to gain
insights from website data and optimize digital marketing strategies.
Impact of Web
Analytics Description
Customer Behavior Web analytics tracks user behavior, interactions, and preferences,
Analysis providing insights into customer journeys and experiences.
Web analytics has become a cornerstone of digital marketing, helping businesses understand
customer behavior, improve user experiences, and optimize marketing efforts for higher
engagement and conversion rates. Its impact on big data applications allows organizations to
adapt and thrive in the dynamic digital landscape.
9. Explain how big data is applied in fraud detection and health care analysis. (MAY/JUNE
2025)
Key Points:
Fraud Detection:
o Real-time data processing from various sources (transactions, logs, etc.)
o Pattern recognition and anomaly detection using ML models
o Predictive modeling to flag suspicious activities
o Case: Credit card fraud detection
Health Care Analysis:
o Patient data analytics for diagnosis and treatment personalization
o Predictive analytics for epidemic outbreaks
o Integration of EHR (Electronic Health Records)
o Case: Wearable health tech for remote monitoring
10.i) Discuss about few open-source technologies in Big Data. (NOV/DEC 2024)
Key Points:
Apache Hadoop: Distributed storage and processing (HDFS + MapReduce)
Apache Spark: Fast in-memory data processing engine
Apache Hive: SQL-like interface on Hadoop
Apache Flink / Storm: Real-time stream processing
Apache Kafka: Distributed messaging system
10.ii) How cloud and big data is related to each other? Explain. (NOV/DEC 2024)
Key Points:
Cloud provides scalable infrastructure for Big Data storage and processing
Cloud-based tools (AWS EMR, Google BigQuery, Azure HDInsight)
Cost efficiency: pay-as-you-use model
Integration and flexibility for real-time processing
Big Data workloads benefit from cloud elasticity and distributed architecture
11. Explain Big Data Analytics and its characteristics. (NOV/DEC 2024)
Key Points:
Definition of Big Data Analytics: Process of examining large datasets to uncover hidden
patterns
Types: Descriptive, Diagnostic, Predictive, Prescriptive
Characteristics (5 Vs):
o Volume
o Velocity
o Variety
o Veracity
o Value
12. Explain Hadoop Architecture and its components with proper diagram. (NOV/DEC
2024)
Key Points:
Hadoop Core Components:
o HDFS (NameNode, DataNode)
o MapReduce (JobTracker, TaskTracker)
YARN Architecture (ResourceManager, NodeManager)
Workflow: Data ingestion → storage → processing
Diagram: Clearly show interaction between components
13. Explain how big data is applied for Inter and Trans-firewall analysis. (MAY/JUNE
2025 – 13 Marks Key)
Key Answer Points (Structure it for 13 Marks):
Introduction (2 Marks):
Define firewall analytics and differentiate inter-firewall (between firewalls) and trans-
firewall (across firewalls and networks).
Role of big data in cybersecurity and intrusion detection.
Big Data in Firewall Analytics (4 Marks):
Collection of massive logs from multiple firewalls.
Use of real-time stream processing (Apache Kafka, Flink).
Pattern detection and correlation of network events.
Identification of anomalies, lateral movements, port scans, etc.
Techniques and Tools (3 Marks):
Use of:
o Machine learning for predictive attack detection.
o SIEM tools enhanced with big data platforms.
o Visualization tools for identifying attack trends (e.g., Kibana).
Data mining and event correlation models.
Applications and Benefits (2 Marks):
Improved incident response time.
Enhanced security posture with proactive defense.
Better compliance reporting and audit trails.
Conclusion (2 Marks):
Big data enables deep, scalable, and real-time analysis of firewall traffic.
Helps in identifying sophisticated threats across network boundaries.
Unit-2
Question 1: (Remembering) What is NoSQL, and how does it differ from traditional
relational databases?
Answer: NoSQL, short for "Not Only SQL," is a database management system designed to
handle large volumes of unstructured and semi-structured data efficiently. Unlike traditional
relational databases, NoSQL databases do not rely on a fixed schema and offer greater flexibility
in data modeling.
NoSQL databases offer advantages in handling unstructured and rapidly evolving data, making
them suitable for modern big data applications and use cases where flexibility and scalability are
crucial.
Question 2: (Understanding) Compare the key-value and document data models in NoSQL
databases.
Answer: Key-value and document data models are two popular data models used in NoSQL
databases, each offering unique benefits for different use cases.
Key-value data models excel in high-performance scenarios, like caching and session
management, due to their simplicity and efficient data retrieval. On the other hand, the document
data model's flexibility makes it well-suited for complex data structures and use cases where data
evolves frequently, like content management systems and e-commerce platforms.
Question 3: (Applying) Explain the concept of graph databases and their applications in
real-world scenarios.
Answer: Graph databases are NoSQL databases that use graph structures to represent and store
data, making them ideal for scenarios where relationships between data points are crucial.
Graph databases excel in scenarios where the analysis of relationships between data points is
vital. Their ability to traverse complex networks efficiently makes them powerful tools for
various real-world applications.
Question 4: (Analyzing) Evaluate the concept of materialized views and their role in
improving database performance.
Answer: Materialized views are precomputed views of data stored physically in the database,
providing improved query performance by avoiding expensive computations during runtime.
Advantages of
Materialized Views Description
Complex Aggregations and Materialized views simplify complex aggregations and joins,
Joins reducing the complexity of queries.
Materialized views are particularly beneficial for large and complex databases, where frequent
query optimization is essential to ensure efficient data retrieval and processing.
Question 5: (Evaluating) Assess the distribution models used in NoSQL databases and their
impact on data availability and fault tolerance.
Answer: Distribution models in NoSQL databases dictate how data is distributed and replicated
across nodes in a distributed system, directly affecting data availability and fault tolerance.
Choosing the appropriate distribution model depends on the specific use case, data volume, and
performance requirements. Properly implemented distribution models play a critical role in
ensuring data availability and fault tolerance in NoSQL databases.
Answer: A master-slave replication setup in a NoSQL database involves one primary node
(master) and one or more secondary nodes (slaves) that replicate data from the master.
Component Description
Master Node Handles write operations and serves as the primary source of data.
Slave Nodes Replicate data from the master node to ensure data redundancy.
Failover Mechanism Automatic failover to a slave node in case the master node fails.
This master-slave replication setup ensures data redundancy, improved read performance, and
fault tolerance by enabling automatic failover to maintain data availability even if the master
node goes offline.
Answer: Maintaining data consistency in a distributed NoSQL database like Cassandra is crucial
for data integrity. Let's outline a comprehensive data consistency strategy:
Table: Data Consistency Strategy for Cassandra
Component Description
Anti-Entropy and Regularly run anti-entropy repair and compaction to reconcile data
Compaction across replicas.
By following this data consistency strategy, the NoSQL database can maintain data integrity and
deliver reliable query results even in a distributed environment.
Question 8: (Evaluating) Assess the role of Cassandra clients in interacting with a
Cassandra database and their advantages.
Answer: Cassandra clients are software libraries that enable applications to interact with the
Cassandra database, executing read and write operations.
Advantages of Cassandra
Clients Description
Load Balancing and Failover Clients handle load balancing and failover to ensure optimal
Management performance and high availability.
Cassandra clients serve as crucial middleware between applications and the database, offering
various advantages that enhance the development and performance of applications interacting
with Cassandra.
11. Explain the concept of graph databases and their applications in real-world scenarios.
(NOV/DEC 2024)
Key Points:
Concept:
o A graph database uses nodes, edges, and properties to represent and store data.
o Nodes = entities, Edges = relationships
o Supports complex, many-to-many relationships efficiently
o Query language: Cypher (Neo4j)
Graph Model Advantage:
o Highly interconnected data
o Fast relationship traversal (better than joins)
Example (Neo4j):
(Person)-[:FRIENDS_WITH]->(Person)
(Person)-[:LIKES]->(Product)
Applications:
o Social Networks (Facebook, LinkedIn)
o Recommendation Engines (e.g., Netflix, Amazon)
o Fraud Detection (analyzing transaction networks)
o Knowledge Graphs (Google Knowledge Panel)
o Supply Chain Management
Benefits:
o Intuitive representation of relationships
o Real-time pathfinding and network analysis
Unit-3
Question 1: (Remembering) What are MapReduce workflows, and how do they enable
distributed data processing?
Answer: MapReduce workflows are programming models used for processing large datasets in a
distributed computing environment. They consist of two main steps: Map and Reduce. The Map
step processes input data and generates key-value pairs as intermediate outputs. The Reduce step
then aggregates and summarizes the intermediate results based on the common keys.
Step Description
In this step, input data is divided into smaller splits, and each split is processed
Map
independently by individual Mapper tasks.
Shuffle and The intermediate key-value pairs generated by the Mappers are sorted and
Sort grouped based on the keys before being passed to the Reducer tasks.
The Reducer tasks aggregate and process the grouped data, producing the final
Reduce
output.
MapReduce workflows enable distributed data processing by leveraging the parallel processing
capabilities of a large cluster of nodes, allowing for efficient analysis of massive datasets.
Answer: MRUnit is a testing framework that allows developers to perform unit tests on
MapReduce applications without the need for a full Hadoop cluster. It provides an environment
to simulate MapReduce job execution locally.
Isolated Testing MRUnit creates an isolated testing environment, ensuring that test
Environment results are consistent and reproducible.
Easy Validation of Developers can validate the output of Mapper and Reducer tasks
Output easily, allowing for quick bug identification.
MRUnit empowers developers to catch errors early in the development process, ensuring the
correctness and robustness of their MapReduce applications.
Question 3: (Applying) Describe the anatomy of a MapReduce job run in a Hadoop cluster.
Answer: The execution of a MapReduce job in a Hadoop cluster involves several stages and
components that work together to process data efficiently.
The user submits the MapReduce job to the Hadoop cluster using the
Job Submission
Hadoop JobClient or the YARN ResourceManager.
Input data is divided into splits, and Mapper tasks process these splits
Map Phase
independently. Intermediate key-value pairs are generated as outputs.
Shuffle and Sort Intermediate outputs from the Mappers are sorted and grouped based on
Phase their keys before being passed to the Reducer tasks.
Reducer tasks process the sorted and grouped data, aggregating and
Reduce Phase
producing the final output.
Job Completion Once all tasks are completed, the JobTracker or ResourceManager marks
the job as successful or failed, and the output is stored in HDFS or the
specified output location.
Understanding the various stages and components involved in a MapReduce job run is essential
for optimizing performance and troubleshooting any issues that may arise during job execution.
Answer: Classic MapReduce and YARN (Yet Another Resource Negotiator) are two different
resource management architectures in Hadoop, serving distinct purposes in handling data
processing tasks.
Support for Other Limited support for other Extensible and supports multiple
Processing Models processing models processing models
Question 5: (Evaluating) Assess the impact of failures in classic MapReduce and YARN on
job execution and data processing.
Answer: Failures in classic MapReduce and YARN can have significant implications for job
execution and data processing tasks.
Impact of
Failures Description
Classic A failure in the JobTracker can result in the entire job being halted, leading
MapReduce to significant delays and possible data loss.
Failures in classic MapReduce can result in job failures and potential data loss, while YARN's
distributed architecture provides better fault tolerance and job recovery capabilities, reducing the
impact of failures on data processing tasks.
Question 6: (Creating) Design a job scheduling strategy for a Hadoop cluster to optimize
resource utilization.
Answer: A well-designed job scheduling strategy in a Hadoop cluster can enhance resource
utilization and overall cluster efficiency.
Component Description
Question 7: (Creating) Develop an algorithm depicting the steps involved in the shuffle and
sort phase of a MapReduce job.
Answer:
The shuffle and sort phase in a MapReduce job involves the movement of intermediate key-value
pairs from Mappers to Reducers. The shuffle and sort phase includes the following steps:
The shuffle and sort phase is a critical step in the MapReduce process, ensuring that relevant data
is grouped, sorted, and sent to the appropriate Reducers for further processing.
Question 8: (Evaluating) Evaluate the significance of input formats and output formats in
MapReduce jobs.
Answer: Input formats and output formats play a crucial role in defining how data is read from
input sources and written to output destinations in MapReduce jobs.
Input formats determine how data is read from the input source (e.g.,
Data Readability HDFS, databases). Properly chosen input formats ensure data
readability and integrity during processing.
Input formats enable data splitting into manageable splits, which are
Data Splitting and
processed in parallel by Mappers, leading to efficient data distribution
Distribution
and processing.
Significance of
Output Formats Description
Data Write Flexibility Output formats determine how the results are written to the output
destination (e.g., HDFS, databases). Different formats cater to
different use cases and applications.
Key Points:
MapReduce Overview: Programming model for processing large data sets in parallel.
Abstraction:
o Hides details of parallelization, fault tolerance, data distribution, and load
balancing.
Working:
o Map phase: Processes input data and emits key-value pairs.
o Reduce phase: Aggregates results from Map phase.
Efficiency Features:
o Automatic handling of data partitioning and task scheduling.
o Fault tolerance through re-execution of failed tasks.
Cluster-based Processing:
o Uses HDFS for storage, tasks distributed across nodes.
Example: Word count or log analysis using MapReduce.
10. Describe shuffling and sorting technique in MapReduce with example. (NOV/DEC
2024)
Key Points:
Shuffling:
o Process of transferring Map output to the appropriate Reducers.
o Ensures all values for the same key go to the same reducer.
Sorting:
o Map output is sorted by keys before being fed to the reducer.
o Helps in grouping data efficiently.
Importance: Required for grouping phase in reducers.
Example:
o Word Count: All occurrences of "Hadoop" grouped and sent to one reducer.
Diagram:
o Show flow: Map → Shuffle → Sort → Reduce
11. What is unit testing? Explain the steps followed in unit test with MRUnit. (APR/MAY
2025)
Key Points:
Unit Testing:
o Testing individual components (units) of software.
o In MapReduce, test logic in Mapper/Reducer classes.
MRUnit:
o A Java-based testing framework for MapReduce unit testing.
o Allows testing without running on full Hadoop cluster.
Steps:
1. Add MRUnit JAR to project.
2. Define test data (input/output).
3. Instantiate MapDriver, ReduceDriver, or MapReduceDriver.
4. Set input/output values.
5. Call .runTest() and verify output.
Example: Unit test for Mapper that emits (word, 1)
12. Explain various job scheduling algorithms followed in YARN job scheduling with clear
examples. (APR/MAY 2025)
Key Points:
YARN Schedulers:
1. FIFO Scheduler:
Jobs executed in order of submission.
Simple but not fair in resource sharing.
Use case: Small, single-user environments.
2. Capacity Scheduler:
Divides cluster into queues with defined capacities.
Supports multi-tenancy.
Use case: Organizations with resource partitions per team.
3. Fair Scheduler:
Resources distributed equally among jobs over time.
Jobs get fair share, avoids starvation.
Use case: Shared clusters with interactive and batch jobs.
Examples: Multiple users running jobs; how Fair scheduler assigns equal resources over
time.
13. Explain the architecture of YARN and discuss how it manages resources and job
scheduling in a Hadoop cluster. (NOV/DEC 2024)
Key Points:
YARN (Yet Another Resource Negotiator):
o Decouples resource management and job scheduling from MapReduce.
Architecture Components:
1. ResourceManager (RM):
Global scheduler and arbitrator.
2. NodeManager (NM):
Manages resources on individual nodes.
3. ApplicationMaster (AM):
Handles job-specific logic.
4. Container:
Execution environment with allocated resources.
Job Scheduling:
o RM allocates containers to AMs.
o AM requests further containers to run tasks.
Diagram: Show ResourceManager, NodeManagers, ApplicationMaster, Containers
14. i) Explain the architecture of Apache Hadoop YARN and its significance in the Hadoop
ecosystem. (NOV/DEC 2024)
Key Points:
Repeat of Q13 with emphasis on ecosystem benefits:
o Supports multiple frameworks (MapReduce, Spark, Tez, etc.)
o Better scalability, resource utilization, and multi-tenancy
14. ii) Explain how a MapReduce job runs on YARN. (NOV/DEC 2024)
Key Points:
Job Execution Flow:
1. Job submission by client to ResourceManager.
2. ResourceManager starts ApplicationMaster in a container.
3. ApplicationMaster negotiates more containers for Map/Reduce tasks.
4. Tasks run inside containers managed by NodeManagers.
5. Status updates, job completion reported to client.
Key Concepts:
o ApplicationMaster lifecycle
o MapTask and ReduceTask handled as independent containers
o Fault tolerance via retry and rescheduling
Unit-4
Question 1: (Remembering) What is Hadoop Streaming, and how does it enable data
processing with non-Java programs in Hadoop?
Answer: Hadoop Streaming is a utility in Hadoop that enables data processing with non-Java
programs. It allows developers to use any programming language that can read from standard
input and write to standard output as Mapper and Reducer functions in MapReduce jobs.
Advantages of Hadoop
Streaming Description
Hadoop Streaming is particularly useful when specialized processing tasks require languages
other than Java, making it a versatile tool for data processing in Hadoop.
Question 2: (Understanding) How does Hadoop Pipes facilitate the integration of C++
programs with Hadoop?
Answer: Hadoop Pipes is a C++ API that enables the integration of C++ programs with Hadoop.
It allows developers to create Mappers and Reducers using C++ programming language,
providing an alternative to Java for data processing in Hadoop.
Advantages of
Hadoop Pipes Description
Existing C++ Code Organizations with existing C++ codebases can reuse their libraries
Reuse and algorithms in Hadoop, saving development time and effort.
Hadoop Pipes is an excellent choice for organizations with C++ expertise, allowing them to
leverage their existing codebase for data processing in Hadoop.
Question 3: (Applying) Describe the design of the Hadoop Distributed File System (HDFS)
and its key features.
Answer: The Hadoop Distributed File System (HDFS) is the storage layer of the Hadoop
ecosystem, designed to handle massive datasets distributed across a cluster of commodity
hardware.
Table: Key Features of Hadoop Distributed File System (HDFS)
Feature Description
Write-Once-Read-Many Data in HDFS is typically written once and read multiple times,
(WORM) Model making it suitable for batch processing.
The design of HDFS enables efficient and reliable storage and retrieval of large-scale data,
making it the backbone of many big data applications.
Question 4: (Analyzing) Compare Hadoop I/O methods - Local I/O and HDFS I/O, and
their impact on data processing in Hadoop.
Answer: Hadoop supports two primary I/O methods: Local I/O, which deals with data on the
local file system, and HDFS I/O, which involves reading and writing data to and from the
Hadoop Distributed File System (HDFS).
Local I/O stores data on a single HDFS I/O stores data across
Data Storage and
node and lacks data replication for multiple nodes with replication for
Replication
fault tolerance. fault tolerance.
Local I/O moves data to and from a HDFS I/O accesses data locally on
Data Movement
single node, potentially leading to each node, reducing data
and Data Access
data movement bottlenecks. movement overhead.
Fault Tolerance Local I/O lacks inherent fault HDFS I/O provides built-in data
replication for fault tolerance and
tolerance features.
data availability.
HDFS I/O outperforms Local I/O in Hadoop environments by providing distributed storage, fault
tolerance, and scalability, enabling efficient data processing in large-scale distributed systems.
Question 5: (Evaluating) Assess the significance of data integrity in Hadoop and its impact
on data quality and reliability.
Answer: Data integrity is a critical aspect of Hadoop data processing, ensuring data quality and
reliability throughout the data lifecycle.
Impact of Data
Integrity Description
Data Accuracy and Ensuring data integrity guarantees the accuracy and reliability of
Quality analytical results derived from Hadoop data processing.
Preventing Data Data integrity mechanisms like checksums and replication prevent
Corruption data corruption during storage and transmission.
Compliance and Data Data integrity is essential for maintaining compliance with regulatory
Governance requirements and data governance policies.
Data integrity is fundamental in Hadoop to preserve the trustworthiness of data, prevent data
corruption, and foster confidence in the analytical insights derived from big data processing.
Question 6: (Creating) Design a data compression strategy for Hadoop to optimize storage
and processing efficiency.
Answer: A data compression strategy in Hadoop involves compressing input data for storage and
decompressing it during processing, optimizing storage space and processing efficiency.
Component Description
Input Data Compress input data before storing it in HDFS to reduce storage space
Compression requirements.
Output Data Compress output data generated by MapReduce jobs to minimize data
Compression transfer and storage costs.
A well-designed data compression strategy in Hadoop optimizes storage utilization and reduces
data transfer costs, enhancing overall performance and cost-efficiency.
Question 7: (Creating) Explain the concept of Avro serialization and its advantages in
Hadoop.
Answer: Avro is a data serialization system that allows for efficient and compact data storage
and exchange between programs in Hadoop.
Advantages of Avro
Serialization Description
Avro's schema evolution capabilities, compact binary encoding, and language independence
make it an ideal choice for data serialization in Hadoop, facilitating efficient data processing and
data interchange between applications.
Question 8: (Evaluating) Evaluate the integration of Cassandra with Hadoop and its
significance in big data analytics.
Answer: The integration of Cassandra with Hadoop combines the strengths of both systems,
enabling efficient big data analytics and real-time data processing.
Significance of
Integration Description
9. Explain the role of Java interfaces in HDFS. Discuss how these interfaces are used to
interact with HDFS and describe their key functions and methods. (APR/MAY 2025)
Key Points:
Role of Java Interfaces in HDFS:
o Java interfaces provide APIs for clients and developers to interact with HDFS
programmatically.
o Enable file operations, directory manipulation, and metadata handling.
Key Interfaces and Classes:
o FileSystem (abstract class): Base class for all Hadoop file systems.
o DistributedFileSystem: Implementation of HDFS in Java.
o FSDataInputStream, FSDataOutputStream: For reading and writing data.
o Path: Represents file/directory paths.
Common Methods:
fs.open(new Path("/file.txt"));
fs.create(new Path("/file.txt"));
fs.delete(new Path("/file.txt"), true);
fs.listStatus(new Path("/"));
Usage Example:
o Java program using Hadoop API to read and write a file to HDFS.
o Explain Configuration conf = new Configuration(); FileSystem fs =
FileSystem.get(conf);
10. Discuss the advantage of using Apache Avro over other Hadoop ecosystem tools.
Highlight Avro’s compact binary encoding, schema evolution support and integration with
Hadoop ecosystem tools. (NOV/DEC 2024)
Key Points:
What is Avro:
o A row-oriented remote procedure call and data serialization framework.
o Designed for efficient serialization, particularly in Hadoop environments.
Advantages:
o Compact binary encoding:
Smaller file size and faster data transmission
Better suited for large-scale data storage
o Schema Evolution:
Supports backward and forward compatibility
Reader and writer schemas can differ as long as rules are followed
o Integration with Hadoop Ecosystem:
Works with Hive, Pig, MapReduce, and Kafka
Supports splittable files – better parallel processing in MapReduce
Use Case Example:
o Storing user logs in Avro format, processing with Pig or Hive
11. Define HDFS. Describe NameNode, DataNode, and Block. Explain HDFS operations in
detail. (NOV/DEC 2024)
Key Points:
Definition:
o HDFS: Hadoop Distributed File System, designed for fault-tolerant storage of
large files across clusters.
Components:
o NameNode:
Master node
Stores metadata (file names, permissions, block locations)
o DataNode:
Slave node
Stores actual data blocks
o Block:
Unit of storage (default 128 MB)
Files are split into blocks for distributed storage
HDFS Operations:
1. Read Operation:
Client requests file → NameNode gives block locations → Client reads
directly from DataNodes
2. Write Operation:
Client contacts NameNode → Data blocks are written to multiple
DataNodes (replication)
3. Replication and Fault Tolerance:
Default replication factor = 3
Diagram:
o Show NameNode, DataNodes, client, block flow during read/write
12. With necessary examples, explain how serialization and deserialization of data is done.
(APR/MAY 2025)
Key Points:
Definition:
o Serialization: Converting object into a stream of bytes
o Deserialization: Reconstructing object from byte stream
In Hadoop:
o Important for data transmission between nodes
o Used in MapReduce, HDFS, Avro, Thrift, Protocol Buffers
Example using Java Writable:
public class Student implements Writable {
private IntWritable id;
private Text name;
public void write(DataOutput out) throws IOException {
id.write(out);
name.write(out);
} public void readFields(DataInput in) throws IOException {
id.readFields(in);
name.readFields(in);
}}Using Avro (Example):
o Define schema → serialize with DatumWriter → deserialize with DatumReader
13. Evaluate the integration of Cassandra with Hadoop and its significance in big data
analytics. (NOV/DEC 2024)
Key Points:
Why Integrate Cassandra with Hadoop:
o Cassandra: High-availability, NoSQL wide-column store
o Hadoop: Distributed processing
o Integration helps query Cassandra data using MapReduce, Hive, Pig
Methods of Integration:
o Hadoop-Cassandra Connector: Enables read/write between HDFS and
Cassandra
o MapReduce with CQL (Cassandra Query Language)
o Tools: CassandraBulkRecordWriter, Hive-Cassandra integration
Significance in Analytics:
o Run batch analytics on Cassandra data without migration
o Use Hadoop for complex computation over data stored in Cassandra
o Supports hybrid workloads – fast OLTP (Cassandra) + batch analytics (Hadoop)
Example Use Case:
o IoT sensor data stored in Cassandra, analyzed with Hive or Spark on Hadoop
Unit-5
Question 1: (Remembering) What is HBase, and how does its data model differ from
traditional relational databases?
Answer: HBase is a distributed, scalable, and column-oriented NoSQL database built on top of
Hadoop Distributed File System (HDFS). It follows the Bigtable data model, which is different
from traditional relational databases.
Table: Comparison between HBase Data Model and Traditional Relational Database Model
HBase's data model and distributed architecture make it ideal for handling large-scale, real-time,
and high-throughput data scenarios.
Question 2: (Understanding) How do HBase clients interact with the HBase database, and
what are the different types of HBase clients?
Answer: HBase clients interact with the HBase database to perform read and write operations on
data. There are mainly two types of HBase clients: Java-based clients and RESTful clients.
HBase Client
Type Description
Java clients interact with HBase using the HBase Java API. They provide
Java-based
extensive control over HBase operations and are suitable for Java-centric
Clients
applications.
RESTful clients use HTTP methods to communicate with HBase via the
RESTful
HBase REST API. They offer language independence and are suitable for
Clients
applications in various programming languages.
HBase clients provide programmatic access to HBase data, allowing applications to read, write,
and manage data in the distributed database.
Question 3: (Applying) Provide examples of typical use cases for HBase and illustrate how
its data model supports them.
Answer: HBase is well-suited for various use cases due to its distributed, column-oriented data
model. Here are some examples:
Social Media and HBase's schemaless nature enables flexible data modeling,
Recommendations making it suitable for social media data and personalized
recommendations.
HBase's data model provides the necessary flexibility and scalability for a wide range of use
cases, making it a popular choice for big data applications.
Question 4: (Analyzing) Compare praxis.Pig and Grunt in Apache Pig, focusing on their
roles in data processing.
Answer: praxis.Pig and Grunt are two modes of interacting with Apache Pig, a high-level
platform for processing and analyzing large datasets in Hadoop.
praxis.Pig simplifies the development Grunt offers full flexibility and control
process for users who prefer a graphical over Pig operations, making it suitable
Ease of Use
interface and have limited knowledge for experienced users and complex
of Pig Latin scripting. data processing tasks.
Grunt requires familiarity with Pig
praxis.Pig has a gentle learning curve,
Learning Latin and command-line interfaces,
allowing beginners to get started with
Curve which may have a steeper learning
Pig data processing quickly.
curve for some users.
Both praxis.Pig and Grunt serve as interfaces for interacting with Apache Pig, catering to users
with different preferences and levels of expertise.
Question 5: (Evaluating) Assess the Pig data model and how it facilitates data processing
using Pig Latin scripts.
Answer: The Pig data model abstracts the complexities of data processing in Apache Pig,
providing a high-level interface for users to write data transformation and analysis using Pig
Latin scripts.
Table: Advantages of the Pig Data Model and Pig Latin Scripts
Data Flow Optimization Pig Latin optimizes data flow automatically, allowing users to
focus on data processing logic rather than implementation
details.
Support for Complex Data Pig Latin supports complex data transformations, including
Operations joins, aggregations, and filtering, simplifying big data analytics.
The Pig data model and Pig Latin scripts enhance productivity, reduce development time, and
enable users to process large datasets with ease.
Question 6: (Creating) Design a Pig Latin script to analyze a dataset for sentiment analysis,
including data loading, processing, and storing results.
Answer: Assume we have a dataset containing user reviews with columns: review_id, user_id,
and review_text. We want to perform sentiment analysis on the review_text and store the results
in HDFS.
cleaned_data = FILTER tokenized_data BY word IS NOT NULL AND word MATCHES '\\w+';
-- Remove non-alphanumeric characters
The above Pig Latin script loads the dataset, tokenizes and cleans the text, performs sentiment
analysis, and stores the average sentiment scores per review and user in HDFS.
Question 7: (Creating) Develop a Pig Latin script to compute the total sales amount for
each product category from a sales dataset.
Answer: Assume we have a sales dataset with columns: product_id, product_name, category, and
sales_amount. We want to compute the total sales amount for each product category.
The above Pig Latin script loads the sales dataset, groups the data by category, calculates the
total sales amount for each category, and stores the results in HDFS.
Question 8: (Evaluating) Assess the significance of Hive data types and file formats in data
processing tasks.
Answer: Hive data types and file formats play a crucial role in data processing tasks, providing
flexibility and optimization for various use cases.
Hive data types and file formats ensure data compatibility, performance optimization, and
schema flexibility, making Hive a powerful tool for big data processing in the Hadoop
ecosystem.
9. What is Hive? Explain about data definition and data manipulation in Hive. (Nov/Dec
2024)
Key Points:
Hive Overview:
o Data warehouse system built on top of Hadoop.
o Allows querying and managing large datasets using HiveQL (SQL-like
language).
Data Definition Language (DDL):
o Used to define schema and manage tables.
o Examples:
CREATE TABLE students (id INT, name STRING);
ALTER TABLE students ADD COLUMNS (age INT);
DROP TABLE students;
Data Manipulation Language (DML):
o Insert, update, delete data.
o Examples:
LOAD DATA INPATH '/user/data/students.csv' INTO TABLE students;
INSERT INTO TABLE students VALUES (1, 'John');
Execution: Hive queries are translated into MapReduce jobs.
10. Develop a Pig Latin script to compute the total sales amount for each product category
from a sales dataset. (Apr/May 2025)
Key Points:
Assume schema: product_id, category, price, quantity
Pig Latin Script:
sales = LOAD 'sales_data.csv' USING PigStorage(',')
AS (product_id:chararray, category:chararray, price:float, quantity:int);
sales_amount = FOREACH sales GENERATE category, (price * quantity) AS total;
DUMP total_sales;
Explanation: Loading, calculating line item totals, grouping by category, summing
totals.
11. Illustrate the usage of ‘filters’, ‘group’, ‘orderBy’, ‘distinct’, and ‘load’ keywords in Pig
scripts. (Apr/May 2025)
Key Points:
LOAD:
A = LOAD 'students.csv' USING PigStorage(',') AS (id:int, name:chararray, marks:int);
FILTER:
B = FILTER A BY marks > 50;
GROUP:
C = GROUP B BY name;
ORDER BY:
D = ORDER B BY marks DESC;
DISTINCT:
E = DISTINCT A;
Explain with output examples showing results from each stage.
12. How will you query the data in Hive? Illustrate with data definition, data manipulation,
and data selection queries. (Nov/Dec 2024)
Key Points:
Data Definition (DDL):
CREATE TABLE employee (id INT, name STRING, salary FLOAT);
Data Manipulation (DML):
LOAD DATA INPATH '/data/emp.csv' INTO TABLE employee;
Data Selection (Querying):
sql
SELECT * FROM employee;
SELECT name, salary FROM employee WHERE salary > 30000;
SELECT AVG(salary) FROM employee;
Hive Query Processing:
o Queries internally translated into MapReduce jobs or Tez/Spark jobs.
13.i) Explain the architecture of Apache HBase and its components with a neat diagram.
(Nov/Dec 2024)
Key Points:
Overview:
o Column-oriented NoSQL database on top of HDFS
o Designed for real-time read/write access to large datasets.
Components:
o HMaster: Coordinates region servers
o Region Server: Manages regions (subsets of table)
o Zookeeper: Coordinates cluster health and metadata
o HFile: Actual file storage in HDFS
o MemStore + WAL (Write Ahead Log)
Diagram:
o Include client → Zookeeper → HMaster → RegionServer → HFile flow.
13.ii) With a suitable example, discuss meta-store concepts in Hive and storage mechanisms
in HBase. (Nov/Dec 2024)
Key Points:
Hive Metastore:
o Central repository storing metadata: table schema, partitioning info, etc.
o Backed by RDBMS (MySQL, Derby)
Example:
o Table student(id, name) stored in /user/hive/warehouse/student/
o Metadata about the table stored in Metastore
HBase Storage Mechanisms:
o Data stored in HFiles (column-wise), indexed by row key
o Supports sparse data, write-optimized
Comparison:
o Hive → batch analytics
o HBase → real-time NoSQL access
14. Explain the detailed process of installing Apache Pig on a Hadoop environment. What
are the prerequisites, steps, and verification methods involved in Pig installation? (Apr/May
2025)
Key Points:
Prerequisites:
o Java installed
o Hadoop installed and running
o Environment variables set (JAVA_HOME, HADOOP_HOME)
Installation Steps:
1. Download Pig from Apache site
2. Extract Pig
3. Set PIG_HOME and update PATH
4. Link Pig with Hadoop
Verification:
o Run Pig in local or mapreduce mode:
pig -x local
or
pig -x mapreduce
Run Sample Script to verify successful execution
15. Provide examples of typical use cases for HBase and illustrate how its data model
supports them. (Apr/May 2025)
Key Points:
Use Cases:
1. Time-series data: IoT sensors, server logs
2. Real-time analytics: Clickstream, fraud detection
3. Large-scale key-value store: Social media, messaging apps
4. Metadata storage: For Hadoop jobs or big data platforms
Data Model Support:
o Row key: fast lookups
o Column family: logical data separation
o Cell versioning: supports historical data
o Schema flexibility
Example:
o IoT system: row key = sensor ID, column = timestamp:value