Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
2 views59 pages

BDA PartB

Uploaded by

rethinakumari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views59 pages

BDA PartB

Uploaded by

rethinakumari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 59

CS334 - Big Data Analytics - Part - B Questions

Unit-1

Question 1: (Remembering) What is big data, and how does it relate to the convergence of
key trends in technology and business?

Answer: Big data refers to the vast volume of structured and unstructured data that organizations
generate and collect. It encompasses three key characteristics: volume, velocity, and variety. The
convergence of big data with key trends in technology and business has led to transformative
changes in various industries.

Table: Key Trends Converging with Big Data

Key Trends Description

Utilizing sophisticated techniques for data analysis, including


Advanced Analytics
predictive modeling and machine learning.

Internet of Things Connecting devices and collecting real-time data, generating massive
(IoT) data streams.

Providing scalable and cost-effective infrastructure for storing and


Cloud Computing
processing big data.
Mobile Business Delivering data insights to mobile devices, enabling on-the-go
Intelligence decision-making.

The convergence of these trends has empowered organizations to extract valuable insights from
big data, enhance operational efficiency, and improve customer experiences, making it a pivotal
aspect of modern business strategies.

Question 2: (Understanding) Explain the concept of unstructured data and its significance
in the context of big data.

Answer: Unstructured data refers to data that lacks a predefined data model or schema. It
includes textual content, images, videos, social media posts, and more. In the context of big data,
unstructured data is significant because it constitutes a substantial portion of the data generated
daily.

Table: Examples of Unstructured Data

Data Type Description

Text Documents Email content, articles, reports, and more.

Images Pictures, logos, diagrams, and visual content.

Videos Multimedia content, recordings, and clips.


Unstructured data presents challenges for traditional databases, which are not optimized to
handle such diverse data formats. However, big data technologies like Hadoop and NoSQL
databases offer solutions to effectively store, process, and analyze unstructured data. Extracting
insights from unstructured data enables organizations to understand customer sentiments,
analyze social media trends, and make data-driven decisions.

Question 3: (Applying) Provide industry examples of big data applications and their
impact on business outcomes.

Answer: Big data applications have revolutionized industries, driving data-driven decision-
making and optimizing various business processes. Let's explore some industry examples and
their impact.

Table: Industry Examples of Big Data Applications and Impact

Industry Big Data Application Impact on Business Outcomes

Personalized marketing and improved


Retail Customer Analytics
inventory management.

Predictive Analytics and Enhanced patient care and treatment


Healthcare
Personalized Medicine outcomes.

Fraud Detection and Risk Improved security and better lending


Finance
Assessment decisions.

Marketing Web Analytics and Customer Targeted marketing and improved customer
Behavior Analysis engagement.

Big data applications empower organizations to analyze vast datasets, gain actionable insights,
and drive business growth by enhancing customer satisfaction and operational efficiency.

Question 4: (Analyzing) Evaluate the role of Hadoop in handling big data and its
advantages for businesses.

Answer: Hadoop plays a pivotal role in handling big data, providing a scalable and cost-effective
solution for data storage and processing.

Table: Advantages of Hadoop in Handling Big Data

Advantages of
Hadoop Description

Hadoop's distributed architecture enables horizontal scaling,


Scalability
accommodating large and growing datasets.

Hadoop replicates data across nodes, ensuring data availability even in


Fault Tolerance
the event of node failures.

Hadoop runs on commodity hardware, reducing infrastructure costs


Cost-effectiveness
compared to proprietary systems.
Hadoop supports diverse data types and formats, including structured
Flexibility
and unstructured data.

Hadoop's MapReduce model enables parallel processing, improving


Parallel Processing
data processing efficiency.

By leveraging Hadoop, businesses can handle massive datasets effectively, gain valuable
insights, and accelerate data-driven decision-making.

Question 5: (Evaluating) Assess the significance of open-source technologies in the context


of big data analytics.

Answer: Open-source technologies have had a profound impact on the big data analytics
landscape, offering a range of benefits for organizations.

Table: Significance of Open-source Technologies in Big Data Analytics

Significance of Open-
source Technologies Description

Open-source tools are freely available, enabling organizations


Accessibility
of all sizes to access advanced data analytics capabilities.
Innovation and Open-source projects foster collaboration among developers,
Collaboration leading to continuous innovation and rapid advancements.

Robust open-source communities provide extensive


Community Support documentation and support for users to resolve issues
effectively.

Organizations can tailor open-source tools to meet their


Customizability specific needs and integrate them seamlessly with existing
systems.

Open-source technologies promote interoperability, allowing


Interoperability
different tools to work together cohesively.

The significance of open-source technologies in big data analytics lies in their ability to
democratize access to powerful tools, foster innovation, and empower organizations to harness
the full potential of big data analytics.

Question 6: (Creating) Design a crowd-sourcing analytics project and its application in a


specific domain.

Answer: Designing a crowd-sourcing analytics project involves engaging a diverse group of


individuals to contribute data insights in a collaborative manner. Let's explore an example in the
domain of environmental monitoring.

Table: Crowd-sourcing Analytics Project in Environmental Monitoring


Project Goal Description

Engage citizen scientists to report wildlife sightings, plant


Biodiversity Mapping
species, and environmental observations.

Data Collection and Establish a mobile app or website for users to submit photos,
Verification GPS coordinates, and data.

Data Validation and Implement a verification process to validate submitted data for
Quality Assurance accuracy and reliability.

Data Visualization and Use crowd-sourced data to create interactive maps and reports to
Analysis monitor biodiversity trends.

Impact and Community Share insights and findings with participants, fostering a sense
Engagement of ownership and community engagement.

By harnessing the power of crowd-sourced data, this project promotes environmental


conservation and biodiversity research, creating a collaborative platform for gathering and
analyzing crucial environmental data.
Question 7: (Creating) Develop a plan for inter and trans firewall analytics implementation
for a company's data security.

Answer: Implementing inter and trans firewall analytics is crucial for enhancing data security in
distributed systems. Let's outline the plan for a company:

Table: Plan for Inter and Trans Firewall Analytics Implementation

Step Description

Identify network segments and set up inter firewall rules to enforce


Network Segmentation
segmentation, limiting data access between segments.

Deploy trans firewalls at strategic points to monitor and analyze


Firewall Deployment
network traffic for potential threats and security breaches.

Implement access control policies based on roles and permissions


Data Access Control
to ensure authorized access to sensitive data.

Set up anomaly detection mechanisms to identify suspicious


Anomaly Detection
activities and patterns in real-time.
Incident Response and Develop incident response procedures to handle security incidents
Remediation swiftly and effectively.

By following this plan, the company can strengthen its data security, mitigate risks, and protect
critical data assets from unauthorized access and cyber threats.

Question 8: (Evaluating) Assess the impact of web analytics in big data applications and its
significance for digital marketing.

Answer: Web analytics plays a vital role in big data applications, enabling organizations to gain
insights from website data and optimize digital marketing strategies.

Table: Impact of Web Analytics in Big Data Applications

Impact of Web
Analytics Description

Customer Behavior Web analytics tracks user behavior, interactions, and preferences,
Analysis providing insights into customer journeys and experiences.

Insights from web analytics enable targeted marketing campaigns,


Personalized Marketing
tailored to specific audience segments.
Campaign Performance Analyzing web data helps measure the effectiveness of marketing
Tracking campaigns, allowing continuous improvement.

Real-time Decision- Web analytics provides real-time data, empowering organizations to


making make data-driven decisions on the spot.

Web analytics has become a cornerstone of digital marketing, helping businesses understand
customer behavior, improve user experiences, and optimize marketing efforts for higher
engagement and conversion rates. Its impact on big data applications allows organizations to
adapt and thrive in the dynamic digital landscape.

9. Explain how big data is applied in fraud detection and health care analysis. (MAY/JUNE
2025)
Key Points:
 Fraud Detection:
o Real-time data processing from various sources (transactions, logs, etc.)
o Pattern recognition and anomaly detection using ML models
o Predictive modeling to flag suspicious activities
o Case: Credit card fraud detection
 Health Care Analysis:
o Patient data analytics for diagnosis and treatment personalization
o Predictive analytics for epidemic outbreaks
o Integration of EHR (Electronic Health Records)
o Case: Wearable health tech for remote monitoring

10.i) Discuss about few open-source technologies in Big Data. (NOV/DEC 2024)
Key Points:
 Apache Hadoop: Distributed storage and processing (HDFS + MapReduce)
 Apache Spark: Fast in-memory data processing engine
 Apache Hive: SQL-like interface on Hadoop
 Apache Flink / Storm: Real-time stream processing
 Apache Kafka: Distributed messaging system
10.ii) How cloud and big data is related to each other? Explain. (NOV/DEC 2024)
Key Points:
 Cloud provides scalable infrastructure for Big Data storage and processing
 Cloud-based tools (AWS EMR, Google BigQuery, Azure HDInsight)
 Cost efficiency: pay-as-you-use model
 Integration and flexibility for real-time processing
 Big Data workloads benefit from cloud elasticity and distributed architecture

11. Explain Big Data Analytics and its characteristics. (NOV/DEC 2024)
Key Points:
 Definition of Big Data Analytics: Process of examining large datasets to uncover hidden
patterns
 Types: Descriptive, Diagnostic, Predictive, Prescriptive
 Characteristics (5 Vs):
o Volume
o Velocity
o Variety
o Veracity
o Value

12. Explain Hadoop Architecture and its components with proper diagram. (NOV/DEC
2024)
Key Points:
 Hadoop Core Components:
o HDFS (NameNode, DataNode)
o MapReduce (JobTracker, TaskTracker)
 YARN Architecture (ResourceManager, NodeManager)
 Workflow: Data ingestion → storage → processing
 Diagram: Clearly show interaction between components

13. Explain how big data is applied for Inter and Trans-firewall analysis. (MAY/JUNE
2025 – 13 Marks Key)
Key Answer Points (Structure it for 13 Marks):
Introduction (2 Marks):
 Define firewall analytics and differentiate inter-firewall (between firewalls) and trans-
firewall (across firewalls and networks).
 Role of big data in cybersecurity and intrusion detection.
Big Data in Firewall Analytics (4 Marks):
 Collection of massive logs from multiple firewalls.
 Use of real-time stream processing (Apache Kafka, Flink).
 Pattern detection and correlation of network events.
 Identification of anomalies, lateral movements, port scans, etc.
Techniques and Tools (3 Marks):
 Use of:
o Machine learning for predictive attack detection.
o SIEM tools enhanced with big data platforms.
o Visualization tools for identifying attack trends (e.g., Kibana).
 Data mining and event correlation models.
Applications and Benefits (2 Marks):
 Improved incident response time.
 Enhanced security posture with proactive defense.
 Better compliance reporting and audit trails.
Conclusion (2 Marks):
 Big data enables deep, scalable, and real-time analysis of firewall traffic.
 Helps in identifying sophisticated threats across network boundaries.
Unit-2

Question 1: (Remembering) What is NoSQL, and how does it differ from traditional
relational databases?

Answer: NoSQL, short for "Not Only SQL," is a database management system designed to
handle large volumes of unstructured and semi-structured data efficiently. Unlike traditional
relational databases, NoSQL databases do not rely on a fixed schema and offer greater flexibility
in data modeling.

Table: Comparison between NoSQL and Traditional Relational Databases

Aspect NoSQL Databases Traditional Relational Databases

Rigid and adheres to predefined


Data Model Flexible and schemaless
schema

Scaling Horizontally scalable Vertically scalable

Query Language Various query languages SQL (Structured Query Language)

ACID May not support ACID


ACID-compliant
Transactions properties

Data Types Supports diverse data types Limited data types


Relationships Less emphasis on relationships Emphasizes relationships with JOINs

NoSQL databases offer advantages in handling unstructured and rapidly evolving data, making
them suitable for modern big data applications and use cases where flexibility and scalability are
crucial.

Question 2: (Understanding) Compare the key-value and document data models in NoSQL
databases.

Answer: Key-value and document data models are two popular data models used in NoSQL
databases, each offering unique benefits for different use cases.

Table: Comparison between Key-Value and Document Data Models

Aspect Key-Value Data Model Document Data Model

Data Structure Simple key-value pairs JSON-like documents

Limited to simple data Supports nested and complex data


Flexibility
structures structures

Schema Schemaless Schemaless


Querying Limited query support Rich querying capabilities with indexes

Content management systems, e-


Use Cases Caching, session management
commerce

Key-value data models excel in high-performance scenarios, like caching and session
management, due to their simplicity and efficient data retrieval. On the other hand, the document
data model's flexibility makes it well-suited for complex data structures and use cases where data
evolves frequently, like content management systems and e-commerce platforms.

Question 3: (Applying) Explain the concept of graph databases and their applications in
real-world scenarios.

Answer: Graph databases are NoSQL databases that use graph structures to represent and store
data, making them ideal for scenarios where relationships between data points are crucial.

Table: Applications of Graph Databases in Real-World Scenarios

Scenario Graph Database Application

Modeling and analyzing connections between users and their


Social Networks
relationships.
Recommendation Generating personalized recommendations based on user
Engines interactions.

Identifying complex patterns and fraud rings by analyzing networks


Fraud Detection
of transactions.

Building structured representations of knowledge and semantic


Knowledge Graphs
relationships.

Graph databases excel in scenarios where the analysis of relationships between data points is
vital. Their ability to traverse complex networks efficiently makes them powerful tools for
various real-world applications.

Question 4: (Analyzing) Evaluate the concept of materialized views and their role in
improving database performance.

Answer: Materialized views are precomputed views of data stored physically in the database,
providing improved query performance by avoiding expensive computations during runtime.

Table: Advantages of Materialized Views for Database Performance

Advantages of
Materialized Views Description

Query Performance Materialized views store precomputed results, reducing query


execution time and improving performance.

Data Redundancy and Data redundancy in materialized views enhances read


Optimization performance, optimizing frequent queries.

Complex Aggregations and Materialized views simplify complex aggregations and joins,
Joins reducing the complexity of queries.

Materialized views enhance scalability by reducing the load on


Scalability
the main database during query execution.

Materialized views are particularly beneficial for large and complex databases, where frequent
query optimization is essential to ensure efficient data retrieval and processing.

Question 5: (Evaluating) Assess the distribution models used in NoSQL databases and their
impact on data availability and fault tolerance.

Answer: Distribution models in NoSQL databases dictate how data is distributed and replicated
across nodes in a distributed system, directly affecting data availability and fault tolerance.

Table: Distribution Models in NoSQL Databases and Their Impact

Distribution Impact on Data Availability and Fault


Model Description Tolerance
Data is partitioned into Enhances data availability by reducing single
Sharding shards distributed across points of failure. However, data loss risk
multiple nodes. exists if a shard becomes unavailable.

Improves fault tolerance by ensuring data


Data is replicated across
availability even if some nodes fail. However,
Replication multiple nodes for
increased storage requirements can be a
redundancy.
concern.

Hash function is used to Promotes load balancing and fault tolerance as


Consistent
map data to nodes in a data distribution is evenly spread across
Hashing
consistent manner. nodes.

Choosing the appropriate distribution model depends on the specific use case, data volume, and
performance requirements. Properly implemented distribution models play a critical role in
ensuring data availability and fault tolerance in NoSQL databases.

Question 6: (Creating) Design a master-slave replication setup in a NoSQL database for


data redundancy and fault tolerance.

Answer: A master-slave replication setup in a NoSQL database involves one primary node
(master) and one or more secondary nodes (slaves) that replicate data from the master.

Table: Design of Master-Slave Replication Setup

Component Description
Master Node Handles write operations and serves as the primary source of data.

Slave Nodes Replicate data from the master node to ensure data redundancy.

Synchronization mechanisms ensure that data is consistent across all


Data Synchronization
nodes.

Distribute read queries among slave nodes, improving read


Load Balancing
performance.

Failover Mechanism Automatic failover to a slave node in case the master node fails.

This master-slave replication setup ensures data redundancy, improved read performance, and
fault tolerance by enabling automatic failover to maintain data availability even if the master
node goes offline.

Question 7: (Creating) Develop a comprehensive data consistency strategy for a NoSQL


database like Cassandra.

Answer: Maintaining data consistency in a distributed NoSQL database like Cassandra is crucial
for data integrity. Let's outline a comprehensive data consistency strategy:
Table: Data Consistency Strategy for Cassandra

Component Description

Define the level of consistency for read and write operations,


Consistency Level
balancing performance and data integrity.

Use quorum-based writes to ensure data is written to a majority of


Quorum-based Writes
replicas, ensuring consistency.

Enable read repair to resolve any inconsistencies during read


Read Repair
operations.

Enable hinted handoff to ensure data consistency when a node is


Hinted Handoff
temporarily unavailable.

Anti-Entropy and Regularly run anti-entropy repair and compaction to reconcile data
Compaction across replicas.

By following this data consistency strategy, the NoSQL database can maintain data integrity and
deliver reliable query results even in a distributed environment.
Question 8: (Evaluating) Assess the role of Cassandra clients in interacting with a
Cassandra database and their advantages.

Answer: Cassandra clients are software libraries that enable applications to interact with the
Cassandra database, executing read and write operations.

Table: Advantages of Cassandra Clients

Advantages of Cassandra
Clients Description

Cassandra clients offer support for multiple programming


Language Support
languages, providing flexibility for developers.

Clients abstract Cassandra's data model, simplifying data


Data Model Abstraction
access and management for applications.

Load Balancing and Failover Clients handle load balancing and failover to ensure optimal
Management performance and high availability.

Cassandra clients support asynchronous operations, enabling


Asynchronous Operations
non-blocking communication with the database.

Query Optimization Clients optimize queries, reducing latency and improving


overall application performance.

Cassandra clients serve as crucial middleware between applications and the database, offering
various advantages that enhance the development and performance of applications interacting
with Cassandra.

9. Explain the following with examples. (MAY/JUNE 2025)


i) Cassandra Data Model
Key Points (with Example):
 Core Concepts:
o Keyspace: Similar to a database in RDBMS.
o Column Family: Like a table.
o Row: Identified by a unique row key.
o Columns: Each row contains multiple columns (can vary between rows).
 Data Model Type: Wide-column store (schema-less, flexible).
 Primary Key: Combines partition key and clustering columns for data distribution.
 Example:
CREATE TABLE student (
id UUID PRIMARY KEY,
name TEXT,
dept TEXT,
marks INT
);
ii) Cassandra Clients
Key Points:
 Cassandra Clients help applications communicate with the database.
 Examples of Clients:
o CQLSH – Command-line interface
o DataStax Drivers – Java, Python, Node.js, etc.
o Thrift API – Low-level access
 Advantages:
o Multi-language support
o Efficient handling of distributed queries
o Supports async and non-blocking operations
 Use Case Example:
o Python app using cassandra-driver to connect and query a student records DB
10. With necessary examples, explain aggregate data model in NoSQL. (MAY/JUNE 2025)
Key Points:
 Definition:
o Aggregate model stores related data together as a single document or object.
o Often used in document-based databases (e.g., MongoDB), key-value stores, and
column-family stores.
 Types of Aggregate Models:
o Document Store (MongoDB)
o Key-Value Store (Redis, Riak)
o Wide-Column Store (Cassandra)
 Advantages:
o Fast access – all required data in one object
o Simplified transaction management
o Denormalized data – optimized for read
 Example (MongoDB):
{
"student_id": "S001",
"name": "John",
"courses": [
{"course_id": "CS101", "grade": "A"},
{"course_id": "CS102", "grade": "B"}
]
}
 Use Case: Shopping cart, blog post with comments, order system

11. Explain the concept of graph databases and their applications in real-world scenarios.
(NOV/DEC 2024)
Key Points:
 Concept:
o A graph database uses nodes, edges, and properties to represent and store data.
o Nodes = entities, Edges = relationships
o Supports complex, many-to-many relationships efficiently
o Query language: Cypher (Neo4j)
 Graph Model Advantage:
o Highly interconnected data
o Fast relationship traversal (better than joins)
 Example (Neo4j):
(Person)-[:FRIENDS_WITH]->(Person)
(Person)-[:LIKES]->(Product)
 Applications:
o Social Networks (Facebook, LinkedIn)
o Recommendation Engines (e.g., Netflix, Amazon)
o Fraud Detection (analyzing transaction networks)
o Knowledge Graphs (Google Knowledge Panel)
o Supply Chain Management
 Benefits:
o Intuitive representation of relationships
o Real-time pathfinding and network analysis

Unit-3

Question 1: (Remembering) What are MapReduce workflows, and how do they enable
distributed data processing?

Answer: MapReduce workflows are programming models used for processing large datasets in a
distributed computing environment. They consist of two main steps: Map and Reduce. The Map
step processes input data and generates key-value pairs as intermediate outputs. The Reduce step
then aggregates and summarizes the intermediate results based on the common keys.

Table: MapReduce Workflow Steps

Step Description

In this step, input data is divided into smaller splits, and each split is processed
Map
independently by individual Mapper tasks.

Shuffle and The intermediate key-value pairs generated by the Mappers are sorted and
Sort grouped based on the keys before being passed to the Reducer tasks.

The Reducer tasks aggregate and process the grouped data, producing the final
Reduce
output.

MapReduce workflows enable distributed data processing by leveraging the parallel processing
capabilities of a large cluster of nodes, allowing for efficient analysis of massive datasets.

Question 2: (Understanding) How does MRUnit facilitate unit testing in MapReduce


applications?

Answer: MRUnit is a testing framework that allows developers to perform unit tests on
MapReduce applications without the need for a full Hadoop cluster. It provides an environment
to simulate MapReduce job execution locally.

Table: Advantages of MRUnit for Unit Testing


Advantages of
MRUnit Description

MRUnit enables developers to test their code locally and quickly,


Fast and Local Testing
without the overhead of setting up a Hadoop cluster.

Isolated Testing MRUnit creates an isolated testing environment, ensuring that test
Environment results are consistent and reproducible.

Easy Validation of Developers can validate the output of Mapper and Reducer tasks
Output easily, allowing for quick bug identification.

MRUnit integrates seamlessly with JUnit, making it easy to


Integration with JUnit
incorporate unit testing into the development workflow.

MRUnit empowers developers to catch errors early in the development process, ensuring the
correctness and robustness of their MapReduce applications.

Question 3: (Applying) Describe the anatomy of a MapReduce job run in a Hadoop cluster.

Answer: The execution of a MapReduce job in a Hadoop cluster involves several stages and
components that work together to process data efficiently.

Table: Anatomy of a MapReduce Job Run in Hadoop Cluster


Stage Description

The user submits the MapReduce job to the Hadoop cluster using the
Job Submission
Hadoop JobClient or the YARN ResourceManager.

The JobTracker (classic MapReduce) or ResourceManager (YARN)


Job Initialization
initializes the job, allocating resources and scheduling tasks.

Input data is divided into splits, and Mapper tasks process these splits
Map Phase
independently. Intermediate key-value pairs are generated as outputs.

Shuffle and Sort Intermediate outputs from the Mappers are sorted and grouped based on
Phase their keys before being passed to the Reducer tasks.

Reducer tasks process the sorted and grouped data, aggregating and
Reduce Phase
producing the final output.

Job Completion Once all tasks are completed, the JobTracker or ResourceManager marks
the job as successful or failed, and the output is stored in HDFS or the
specified output location.

Understanding the various stages and components involved in a MapReduce job run is essential
for optimizing performance and troubleshooting any issues that may arise during job execution.

Question 4: (Analyzing) Compare the classic MapReduce and YARN architectures in


Hadoop.

Answer: Classic MapReduce and YARN (Yet Another Resource Negotiator) are two different
resource management architectures in Hadoop, serving distinct purposes in handling data
processing tasks.

Table: Comparison between Classic MapReduce and YARN Architectures

Aspect Classic MapReduce YARN

Resource Management Centralized ResourceManager Distributed ResourceManager

Job Execution Single JobTracker Multiple NodeManagers

Limited scalability for large Highly scalable and supports


Scalability
clusters thousands of nodes

Fault Tolerance Single point of failure - Distributed and fault-tolerant


JobTracker architecture

Support for Other Limited support for other Extensible and supports multiple
Processing Models processing models processing models

YARN addresses the limitations of the classic MapReduce architecture by introducing a


distributed resource management model, supporting various data processing frameworks, and
providing improved scalability and fault tolerance.

Question 5: (Evaluating) Assess the impact of failures in classic MapReduce and YARN on
job execution and data processing.

Answer: Failures in classic MapReduce and YARN can have significant implications for job
execution and data processing tasks.

Table: Impact of Failures in Classic MapReduce and YARN

Impact of
Failures Description

Classic A failure in the JobTracker can result in the entire job being halted, leading
MapReduce to significant delays and possible data loss.

YARN YARN's distributed ResourceManager architecture ensures that job failures


are isolated to specific NodeManagers, allowing other tasks to continue
processing. YARN's fault tolerance enables job recovery and minimizes the
impact on data processing.

Failures in classic MapReduce can result in job failures and potential data loss, while YARN's
distributed architecture provides better fault tolerance and job recovery capabilities, reducing the
impact of failures on data processing tasks.

Question 6: (Creating) Design a job scheduling strategy for a Hadoop cluster to optimize
resource utilization.

Answer: A well-designed job scheduling strategy in a Hadoop cluster can enhance resource
utilization and overall cluster efficiency.

Table: Components of Job Scheduling Strategy

Component Description

Implement the Fair Scheduler, which allocates resources to jobs


Fair Scheduler based on their fairness, ensuring equal opportunities for all jobs to
execute.

Use the Capacity Scheduler to guarantee resource allocations for


Capacity Scheduler
specific user groups or departments, preventing resource hogging.

Queue Prioritization Prioritize queues based on job importance or urgency, ensuring


critical jobs get priority access to resources.

Analyze job profiles and resource requirements to allocate


Job Profiling and
appropriate resources for each job, preventing resource
Resource Estimation
underutilization.

A well-thought-out job scheduling strategy helps maintain a balanced workload distribution,


prevents resource starvation, and optimizes resource utilization in a Hadoop cluster.

Question 7: (Creating) Develop an algorithm depicting the steps involved in the shuffle and
sort phase of a MapReduce job.

Answer:

The shuffle and sort phase in a MapReduce job involves the movement of intermediate key-value
pairs from Mappers to Reducers. The shuffle and sort phase includes the following steps:

Mappers generate intermediate key-value pairs from input data.


The data is partitioned and grouped by keys.
Data with the same key is shuffled to the appropriate Reducer node.
Reducers sort the data based on keys.
Reducers process the sorted data and produce the final output.

The shuffle and sort phase is a critical step in the MapReduce process, ensuring that relevant data
is grouped, sorted, and sent to the appropriate Reducers for further processing.

Question 8: (Evaluating) Evaluate the significance of input formats and output formats in
MapReduce jobs.

Answer: Input formats and output formats play a crucial role in defining how data is read from
input sources and written to output destinations in MapReduce jobs.

Table: Significance of Input and Output Formats


Significance of Input
Formats Description

Input formats determine how data is read from the input source (e.g.,
Data Readability HDFS, databases). Properly chosen input formats ensure data
readability and integrity during processing.

Input formats enable data splitting into manageable splits, which are
Data Splitting and
processed in parallel by Mappers, leading to efficient data distribution
Distribution
and processing.

Hadoop allows developers to create custom input formats, allowing


Customization them to handle various data types and structures tailored to their
specific use cases.

Significance of
Output Formats Description

Data Write Flexibility Output formats determine how the results are written to the output
destination (e.g., HDFS, databases). Different formats cater to
different use cases and applications.

Output formats ensure proper serialization of data, ensuring


Data Serialization
compatibility and easy integration with other systems and tools.

Output formats offer compression options, reducing storage


Output Compression
requirements and enhancing data transfer efficiency.

9. Discuss how MapReduce abstracts parallel processing tasks across a cluster of


computers and facilitates the efficient processing of large-scale data. (NOV/DEC 2024)

Key Points:
 MapReduce Overview: Programming model for processing large data sets in parallel.
 Abstraction:
o Hides details of parallelization, fault tolerance, data distribution, and load
balancing.
 Working:
o Map phase: Processes input data and emits key-value pairs.
o Reduce phase: Aggregates results from Map phase.
 Efficiency Features:
o Automatic handling of data partitioning and task scheduling.
o Fault tolerance through re-execution of failed tasks.
 Cluster-based Processing:
o Uses HDFS for storage, tasks distributed across nodes.
 Example: Word count or log analysis using MapReduce.

10. Describe shuffling and sorting technique in MapReduce with example. (NOV/DEC
2024)
Key Points:
 Shuffling:
o Process of transferring Map output to the appropriate Reducers.
o Ensures all values for the same key go to the same reducer.
 Sorting:
o Map output is sorted by keys before being fed to the reducer.
o Helps in grouping data efficiently.
 Importance: Required for grouping phase in reducers.
 Example:
o Word Count: All occurrences of "Hadoop" grouped and sent to one reducer.
 Diagram:
o Show flow: Map → Shuffle → Sort → Reduce

11. What is unit testing? Explain the steps followed in unit test with MRUnit. (APR/MAY
2025)

Key Points:
 Unit Testing:
o Testing individual components (units) of software.
o In MapReduce, test logic in Mapper/Reducer classes.
 MRUnit:
o A Java-based testing framework for MapReduce unit testing.
o Allows testing without running on full Hadoop cluster.
 Steps:
1. Add MRUnit JAR to project.
2. Define test data (input/output).
3. Instantiate MapDriver, ReduceDriver, or MapReduceDriver.
4. Set input/output values.
5. Call .runTest() and verify output.
 Example: Unit test for Mapper that emits (word, 1)

12. Explain various job scheduling algorithms followed in YARN job scheduling with clear
examples. (APR/MAY 2025)

Key Points:
 YARN Schedulers:
1. FIFO Scheduler:
 Jobs executed in order of submission.
 Simple but not fair in resource sharing.
 Use case: Small, single-user environments.
2. Capacity Scheduler:
 Divides cluster into queues with defined capacities.
 Supports multi-tenancy.
 Use case: Organizations with resource partitions per team.
3. Fair Scheduler:
 Resources distributed equally among jobs over time.
 Jobs get fair share, avoids starvation.
 Use case: Shared clusters with interactive and batch jobs.
 Examples: Multiple users running jobs; how Fair scheduler assigns equal resources over
time.
13. Explain the architecture of YARN and discuss how it manages resources and job
scheduling in a Hadoop cluster. (NOV/DEC 2024)

Key Points:
 YARN (Yet Another Resource Negotiator):
o Decouples resource management and job scheduling from MapReduce.
 Architecture Components:
1. ResourceManager (RM):
 Global scheduler and arbitrator.
2. NodeManager (NM):
 Manages resources on individual nodes.
3. ApplicationMaster (AM):
 Handles job-specific logic.
4. Container:
 Execution environment with allocated resources.
 Job Scheduling:
o RM allocates containers to AMs.
o AM requests further containers to run tasks.
 Diagram: Show ResourceManager, NodeManagers, ApplicationMaster, Containers

14. i) Explain the architecture of Apache Hadoop YARN and its significance in the Hadoop
ecosystem. (NOV/DEC 2024)

Key Points:
 Repeat of Q13 with emphasis on ecosystem benefits:
o Supports multiple frameworks (MapReduce, Spark, Tez, etc.)
o Better scalability, resource utilization, and multi-tenancy
14. ii) Explain how a MapReduce job runs on YARN. (NOV/DEC 2024)

Key Points:
 Job Execution Flow:
1. Job submission by client to ResourceManager.
2. ResourceManager starts ApplicationMaster in a container.
3. ApplicationMaster negotiates more containers for Map/Reduce tasks.
4. Tasks run inside containers managed by NodeManagers.
5. Status updates, job completion reported to client.
 Key Concepts:
o ApplicationMaster lifecycle
o MapTask and ReduceTask handled as independent containers
o Fault tolerance via retry and rescheduling
Unit-4
Question 1: (Remembering) What is Hadoop Streaming, and how does it enable data
processing with non-Java programs in Hadoop?

Answer: Hadoop Streaming is a utility in Hadoop that enables data processing with non-Java
programs. It allows developers to use any programming language that can read from standard
input and write to standard output as Mapper and Reducer functions in MapReduce jobs.

Table: Advantages of Hadoop Streaming

Advantages of Hadoop
Streaming Description

Developers can write MapReduce jobs using their preferred


Language Flexibility
programming language, allowing for greater flexibility.

Existing scripts and programs can be easily integrated into Hadoop


Code Reusability
jobs, promoting code reusability.

Hadoop Streaming encourages contributions from developers


Community
proficient in various programming languages, enriching the Hadoop
Contributions
ecosystem.

Hadoop Streaming is particularly useful when specialized processing tasks require languages
other than Java, making it a versatile tool for data processing in Hadoop.

Question 2: (Understanding) How does Hadoop Pipes facilitate the integration of C++
programs with Hadoop?
Answer: Hadoop Pipes is a C++ API that enables the integration of C++ programs with Hadoop.
It allows developers to create Mappers and Reducers using C++ programming language,
providing an alternative to Java for data processing in Hadoop.

Table: Advantages of Hadoop Pipes

Advantages of
Hadoop Pipes Description

Hadoop Pipes allows C++ developers to seamlessly integrate their


C++ Integration
programs with Hadoop MapReduce.

C++ programs compiled natively for the underlying platform offer


High Performance
superior performance compared to interpreted languages like Java.

Existing C++ Code Organizations with existing C++ codebases can reuse their libraries
Reuse and algorithms in Hadoop, saving development time and effort.

Hadoop Pipes is an excellent choice for organizations with C++ expertise, allowing them to
leverage their existing codebase for data processing in Hadoop.

Question 3: (Applying) Describe the design of the Hadoop Distributed File System (HDFS)
and its key features.

Answer: The Hadoop Distributed File System (HDFS) is the storage layer of the Hadoop
ecosystem, designed to handle massive datasets distributed across a cluster of commodity
hardware.
Table: Key Features of Hadoop Distributed File System (HDFS)

Feature Description

HDFS distributes data across multiple nodes, providing fault


Distributed Storage
tolerance and scalability.

HDFS replicates data blocks across nodes to ensure data


Data Replication
availability even if some nodes fail.

Data in HDFS is stored in fixed-size blocks, typically 128 MB


Block Storage
or 256 MB in size.

Write-Once-Read-Many Data in HDFS is typically written once and read multiple times,
(WORM) Model making it suitable for batch processing.

HDFS uses checksums to ensure data integrity during data read


Data Integrity
and write operations.

The design of HDFS enables efficient and reliable storage and retrieval of large-scale data,
making it the backbone of many big data applications.
Question 4: (Analyzing) Compare Hadoop I/O methods - Local I/O and HDFS I/O, and
their impact on data processing in Hadoop.

Answer: Hadoop supports two primary I/O methods: Local I/O, which deals with data on the
local file system, and HDFS I/O, which involves reading and writing data to and from the
Hadoop Distributed File System (HDFS).

Table: Comparison between Hadoop Local I/O and HDFS I/O

Aspect Local I/O HDFS I/O

Local I/O stores data on a single HDFS I/O stores data across
Data Storage and
node and lacks data replication for multiple nodes with replication for
Replication
fault tolerance. fault tolerance.

HDFS I/O supports scaling out


Scalability and Local I/O does not support
across a cluster and parallel
Parallel horizontal scaling and parallel
processing, optimizing data
Processing processing.
processing.

Local I/O moves data to and from a HDFS I/O accesses data locally on
Data Movement
single node, potentially leading to each node, reducing data
and Data Access
data movement bottlenecks. movement overhead.

Fault Tolerance Local I/O lacks inherent fault HDFS I/O provides built-in data
replication for fault tolerance and
tolerance features.
data availability.

HDFS I/O outperforms Local I/O in Hadoop environments by providing distributed storage, fault
tolerance, and scalability, enabling efficient data processing in large-scale distributed systems.

Question 5: (Evaluating) Assess the significance of data integrity in Hadoop and its impact
on data quality and reliability.

Answer: Data integrity is a critical aspect of Hadoop data processing, ensuring data quality and
reliability throughout the data lifecycle.

Table: Impact of Data Integrity in Hadoop

Impact of Data
Integrity Description

Data Accuracy and Ensuring data integrity guarantees the accuracy and reliability of
Quality analytical results derived from Hadoop data processing.

Preventing Data Data integrity mechanisms like checksums and replication prevent
Corruption data corruption during storage and transmission.

Trust in Data-Driven Maintaining data integrity instills confidence in the data-driven


Decisions decision-making process, promoting its adoption across the
organization.

Compliance and Data Data integrity is essential for maintaining compliance with regulatory
Governance requirements and data governance policies.

Data integrity is fundamental in Hadoop to preserve the trustworthiness of data, prevent data
corruption, and foster confidence in the analytical insights derived from big data processing.

Question 6: (Creating) Design a data compression strategy for Hadoop to optimize storage
and processing efficiency.

Answer: A data compression strategy in Hadoop involves compressing input data for storage and
decompressing it during processing, optimizing storage space and processing efficiency.

Table: Components of Data Compression Strategy

Component Description

Compression Choose an appropriate compression algorithm (e.g., Gzip, Snappy)


Algorithm based on data type and compression ratio requirements.

Input Data Compress input data before storing it in HDFS to reduce storage space
Compression requirements.
Output Data Compress output data generated by MapReduce jobs to minimize data
Compression transfer and storage costs.

Decompression Implement an efficient decompression strategy to ensure timely data


Strategy processing with reduced overhead.

A well-designed data compression strategy in Hadoop optimizes storage utilization and reduces
data transfer costs, enhancing overall performance and cost-efficiency.

Question 7: (Creating) Explain the concept of Avro serialization and its advantages in
Hadoop.

Answer: Avro is a data serialization system that allows for efficient and compact data storage
and exchange between programs in Hadoop.

Table: Advantages of Avro Serialization

Advantages of Avro
Serialization Description

Avro supports schema evolution, enabling changes in data structure


Schema Evolution
without breaking compatibility.
Compact Binary Avro uses a compact binary encoding format, reducing the data size
Encoding and improving data transfer performance.

Avro allows data exchange between programs written in different


Language
languages, promoting interoperability in a multi-language Hadoop
Independence
ecosystem.

Avro's schema evolution capabilities, compact binary encoding, and language independence
make it an ideal choice for data serialization in Hadoop, facilitating efficient data processing and
data interchange between applications.

Question 8: (Evaluating) Evaluate the integration of Cassandra with Hadoop and its
significance in big data analytics.

Answer: The integration of Cassandra with Hadoop combines the strengths of both systems,
enabling efficient big data analytics and real-time data processing.

Table: Significance of Cassandra-Hadoop Integration

Significance of
Integration Description

Cassandra data can be efficiently synchronized with Hadoop,


Data Synchronization
enabling the seamless analysis of real-time and historical data.
Scalability and Fault Combining the scalability of Cassandra with the fault tolerance of
Tolerance Hadoop ensures robustness and high availability in data processing.

Integrating Cassandra with Hadoop allows for deep analytical


Analytical Insights
insights from large datasets and real-time data streams.

The combination of real-time data processing in Cassandra and batch


Real-time Data
processing in Hadoop creates a powerful big data analytics
Processing
ecosystem.

The integration of Cassandra with Hadoop empowers organizations to perform real-time


analytics on vast amounts of data, leveraging both systems' strengths for informed decision-
making and advanced analytics.

9. Explain the role of Java interfaces in HDFS. Discuss how these interfaces are used to
interact with HDFS and describe their key functions and methods. (APR/MAY 2025)
Key Points:
 Role of Java Interfaces in HDFS:
o Java interfaces provide APIs for clients and developers to interact with HDFS
programmatically.
o Enable file operations, directory manipulation, and metadata handling.
 Key Interfaces and Classes:
o FileSystem (abstract class): Base class for all Hadoop file systems.
o DistributedFileSystem: Implementation of HDFS in Java.
o FSDataInputStream, FSDataOutputStream: For reading and writing data.
o Path: Represents file/directory paths.
 Common Methods:
fs.open(new Path("/file.txt"));
fs.create(new Path("/file.txt"));
fs.delete(new Path("/file.txt"), true);
fs.listStatus(new Path("/"));
 Usage Example:
o Java program using Hadoop API to read and write a file to HDFS.
o Explain Configuration conf = new Configuration(); FileSystem fs =
FileSystem.get(conf);

10. Discuss the advantage of using Apache Avro over other Hadoop ecosystem tools.
Highlight Avro’s compact binary encoding, schema evolution support and integration with
Hadoop ecosystem tools. (NOV/DEC 2024)
Key Points:
 What is Avro:
o A row-oriented remote procedure call and data serialization framework.
o Designed for efficient serialization, particularly in Hadoop environments.
 Advantages:
o Compact binary encoding:
 Smaller file size and faster data transmission
 Better suited for large-scale data storage
o Schema Evolution:
 Supports backward and forward compatibility
 Reader and writer schemas can differ as long as rules are followed
o Integration with Hadoop Ecosystem:
 Works with Hive, Pig, MapReduce, and Kafka
 Supports splittable files – better parallel processing in MapReduce
 Use Case Example:
o Storing user logs in Avro format, processing with Pig or Hive

11. Define HDFS. Describe NameNode, DataNode, and Block. Explain HDFS operations in
detail. (NOV/DEC 2024)
Key Points:
 Definition:
o HDFS: Hadoop Distributed File System, designed for fault-tolerant storage of
large files across clusters.
 Components:
o NameNode:
 Master node
 Stores metadata (file names, permissions, block locations)
o DataNode:
 Slave node
 Stores actual data blocks
o Block:
 Unit of storage (default 128 MB)
 Files are split into blocks for distributed storage
 HDFS Operations:
1. Read Operation:
 Client requests file → NameNode gives block locations → Client reads
directly from DataNodes
2. Write Operation:
 Client contacts NameNode → Data blocks are written to multiple
DataNodes (replication)
3. Replication and Fault Tolerance:
 Default replication factor = 3
 Diagram:
o Show NameNode, DataNodes, client, block flow during read/write

12. With necessary examples, explain how serialization and deserialization of data is done.
(APR/MAY 2025)
Key Points:
 Definition:
o Serialization: Converting object into a stream of bytes
o Deserialization: Reconstructing object from byte stream
 In Hadoop:
o Important for data transmission between nodes
o Used in MapReduce, HDFS, Avro, Thrift, Protocol Buffers
 Example using Java Writable:
public class Student implements Writable {
private IntWritable id;
private Text name;
public void write(DataOutput out) throws IOException {
id.write(out);
name.write(out);
} public void readFields(DataInput in) throws IOException {
id.readFields(in);
name.readFields(in);
}}Using Avro (Example):
o Define schema → serialize with DatumWriter → deserialize with DatumReader

13. Evaluate the integration of Cassandra with Hadoop and its significance in big data
analytics. (NOV/DEC 2024)
Key Points:
 Why Integrate Cassandra with Hadoop:
o Cassandra: High-availability, NoSQL wide-column store
o Hadoop: Distributed processing
o Integration helps query Cassandra data using MapReduce, Hive, Pig
 Methods of Integration:
o Hadoop-Cassandra Connector: Enables read/write between HDFS and
Cassandra
o MapReduce with CQL (Cassandra Query Language)
o Tools: CassandraBulkRecordWriter, Hive-Cassandra integration
 Significance in Analytics:
o Run batch analytics on Cassandra data without migration
o Use Hadoop for complex computation over data stored in Cassandra
o Supports hybrid workloads – fast OLTP (Cassandra) + batch analytics (Hadoop)
 Example Use Case:
o IoT sensor data stored in Cassandra, analyzed with Hive or Spark on Hadoop

Unit-5

Question 1: (Remembering) What is HBase, and how does its data model differ from
traditional relational databases?
Answer: HBase is a distributed, scalable, and column-oriented NoSQL database built on top of
Hadoop Distributed File System (HDFS). It follows the Bigtable data model, which is different
from traditional relational databases.

Table: Comparison between HBase Data Model and Traditional Relational Database Model

Aspect HBase Data Model Traditional Relational Database Model

HBase stores data in column


families, consisting of columns Traditional databases use tables with
Structure
within each family. Data is rows and columns.
organized by row keys.

HBase is schemaless, allowing


Traditional databases follow a fixed
flexibility in adding columns
Schema schema, requiring predefined table
dynamically without affecting
structures before data insertion.
existing data.

HBase is designed for horizontal Traditional databases are vertically


Scalability scaling, distributing data across scalable, limited by the hardware
nodes to handle massive datasets. capacity of a single server.

Read/Write HBase excels in read-heavy and Traditional databases perform well in


Operations random read/write operations due structured, relational data processing,
but may suffer in random read/write
to its distributed design.
scenarios.

HBase's data model and distributed architecture make it ideal for handling large-scale, real-time,
and high-throughput data scenarios.

Question 2: (Understanding) How do HBase clients interact with the HBase database, and
what are the different types of HBase clients?

Answer: HBase clients interact with the HBase database to perform read and write operations on
data. There are mainly two types of HBase clients: Java-based clients and RESTful clients.

Table: Types of HBase Clients and their Features

HBase Client
Type Description

Java clients interact with HBase using the HBase Java API. They provide
Java-based
extensive control over HBase operations and are suitable for Java-centric
Clients
applications.

RESTful clients use HTTP methods to communicate with HBase via the
RESTful
HBase REST API. They offer language independence and are suitable for
Clients
applications in various programming languages.
HBase clients provide programmatic access to HBase data, allowing applications to read, write,
and manage data in the distributed database.

Question 3: (Applying) Provide examples of typical use cases for HBase and illustrate how
its data model supports them.

Answer: HBase is well-suited for various use cases due to its distributed, column-oriented data
model. Here are some examples:

Table: Examples of HBase Use Cases and Data Model Support

Use Case Data Model Support

HBase organizes data by row keys, making it efficient for storing


Time Series Data Storage
and querying time-series data.

HBase's column-oriented design allows fast retrieval of specific


Real-time Analytics
data attributes, supporting real-time analytics on massive datasets.

IoT devices generate large volumes of data, and HBase's


Internet of Things (IoT) horizontal scaling accommodates the storage and processing
requirements.

Social Media and HBase's schemaless nature enables flexible data modeling,
Recommendations making it suitable for social media data and personalized
recommendations.

HBase's data model provides the necessary flexibility and scalability for a wide range of use
cases, making it a popular choice for big data applications.

Question 4: (Analyzing) Compare praxis.Pig and Grunt in Apache Pig, focusing on their
roles in data processing.

Answer: praxis.Pig and Grunt are two modes of interacting with Apache Pig, a high-level
platform for processing and analyzing large datasets in Hadoop.

Table: Comparison between praxis.Pig and Grunt in Apache Pig

Aspect praxis.Pig Grunt

praxis.Pig is a graphical data flow tool


Grunt is the command-line shell for
that allows users to design Pig
Role Pig, where users write and execute Pig
workflows visually using a drag-and-
Latin scripts directly.
drop interface.

praxis.Pig simplifies the development Grunt offers full flexibility and control
process for users who prefer a graphical over Pig operations, making it suitable
Ease of Use
interface and have limited knowledge for experienced users and complex
of Pig Latin scripting. data processing tasks.
Grunt requires familiarity with Pig
praxis.Pig has a gentle learning curve,
Learning Latin and command-line interfaces,
allowing beginners to get started with
Curve which may have a steeper learning
Pig data processing quickly.
curve for some users.

Both praxis.Pig and Grunt serve as interfaces for interacting with Apache Pig, catering to users
with different preferences and levels of expertise.

Question 5: (Evaluating) Assess the Pig data model and how it facilitates data processing
using Pig Latin scripts.

Answer: The Pig data model abstracts the complexities of data processing in Apache Pig,
providing a high-level interface for users to write data transformation and analysis using Pig
Latin scripts.

Table: Advantages of the Pig Data Model and Pig Latin Scripts

Advantages of Pig Data


Model and Pig Latin Description

Pig Latin offers a high-level abstraction, making data


High-Level Abstraction processing tasks more accessible to users with limited Hadoop
knowledge.

Data Flow Optimization Pig Latin optimizes data flow automatically, allowing users to
focus on data processing logic rather than implementation
details.

Support for Complex Data Pig Latin supports complex data transformations, including
Operations joins, aggregations, and filtering, simplifying big data analytics.

The Pig data model and Pig Latin scripts enhance productivity, reduce development time, and
enable users to process large datasets with ease.

Question 6: (Creating) Design a Pig Latin script to analyze a dataset for sentiment analysis,
including data loading, processing, and storing results.

Answer: Assume we have a dataset containing user reviews with columns: review_id, user_id,
and review_text. We want to perform sentiment analysis on the review_text and store the results
in HDFS.

Pig Latin Script for Sentiment Analysis

-- Step 1: Load the dataset from HDFS

raw_data = LOAD '/user/hadoop/input/reviews.csv' USING PigStorage(',') AS (review_id: int,


user_id: int, review_text: chararray);

-- Step 2: Tokenize and clean the review text

tokenized_data = FOREACH raw_data GENERATE review_id, user_id,


FLATTEN(TOKENIZE(review_text)) AS word;

cleaned_data = FILTER tokenized_data BY word IS NOT NULL AND word MATCHES '\\w+';
-- Remove non-alphanumeric characters

-- Step 3: Perform sentiment analysis (assumed sentiment_score function)


sentiment_data = FOREACH cleaned_data GENERATE review_id, user_id, word,
sentiment_score(word) AS sentiment;

-- Step 4: Aggregate sentiment scores by review_id and user_id

grouped_data = GROUP sentiment_data BY (review_id, user_id);

average_sentiment = FOREACH grouped_data GENERATE group.review_id AS review_id,


group.user_id AS user_id, AVG(sentiment_data.sentiment) AS avg_sentiment;

-- Step 5: Store the results in HDFS

STORE average_sentiment INTO '/user/hadoop/output/sentiment_analysis' USING


PigStorage(',');

The above Pig Latin script loads the dataset, tokenizes and cleans the text, performs sentiment
analysis, and stores the average sentiment scores per review and user in HDFS.

Question 7: (Creating) Develop a Pig Latin script to compute the total sales amount for
each product category from a sales dataset.

Answer: Assume we have a sales dataset with columns: product_id, product_name, category, and
sales_amount. We want to compute the total sales amount for each product category.

Pig Latin Script for Total Sales Amount by Category

-- Step 1: Load the sales dataset from HDFS

sales_data = LOAD '/user/hadoop/input/sales.csv' USING PigStorage(',') AS (product_id: int,


product_name: chararray, category: chararray, sales_amount: double);

-- Step 2: Group sales data by category

grouped_data = GROUP sales_data BY category;


-- Step 3: Calculate total sales amount for each category

total_sales = FOREACH grouped_data GENERATE group AS category,


SUM(sales_data.sales_amount) AS total_sales_amount;

-- Step 4: Store the results in HDFS

STORE total_sales INTO '/user/hadoop/output/total_sales_by_category' USING PigStorage(',');

The above Pig Latin script loads the sales dataset, groups the data by category, calculates the
total sales amount for each category, and stores the results in HDFS.

Question 8: (Evaluating) Assess the significance of Hive data types and file formats in data
processing tasks.

Answer: Hive data types and file formats play a crucial role in data processing tasks, providing
flexibility and optimization for various use cases.

Table: Significance of Hive Data Types and File Formats

Significance of Hive Data


Types and File Formats Description

Hive supports a wide range of data types, including primitive


Data Flexibility types, complex types (arrays, maps, structs), and user-defined
types, accommodating diverse data structures.

Proper selection of file formats (e.g., ORC, Parquet) enhances


Query Optimization
query performance, reducing data read and processing time.
File formats like ORC and Parquet offer efficient data
Data Compression compression, minimizing storage requirements and improving
query performance.

Hive's schema evolution capabilities allow adding or modifying


Schema Evolution columns without impacting existing data, supporting data model
changes over time.

Hive data types and file formats ensure data compatibility, performance optimization, and
schema flexibility, making Hive a powerful tool for big data processing in the Hadoop
ecosystem.

9. What is Hive? Explain about data definition and data manipulation in Hive. (Nov/Dec
2024)
Key Points:
 Hive Overview:
o Data warehouse system built on top of Hadoop.
o Allows querying and managing large datasets using HiveQL (SQL-like
language).
 Data Definition Language (DDL):
o Used to define schema and manage tables.
o Examples:
CREATE TABLE students (id INT, name STRING);
ALTER TABLE students ADD COLUMNS (age INT);
DROP TABLE students;
 Data Manipulation Language (DML):
o Insert, update, delete data.
o Examples:
LOAD DATA INPATH '/user/data/students.csv' INTO TABLE students;
INSERT INTO TABLE students VALUES (1, 'John');
 Execution: Hive queries are translated into MapReduce jobs.
10. Develop a Pig Latin script to compute the total sales amount for each product category
from a sales dataset. (Apr/May 2025)
Key Points:
 Assume schema: product_id, category, price, quantity
 Pig Latin Script:
sales = LOAD 'sales_data.csv' USING PigStorage(',')
AS (product_id:chararray, category:chararray, price:float, quantity:int);
sales_amount = FOREACH sales GENERATE category, (price * quantity) AS total;

grouped = GROUP sales_amount BY category;

total_sales = FOREACH grouped GENERATE group AS category, SUM(sales_amount.total)


AS total_amount;

DUMP total_sales;
 Explanation: Loading, calculating line item totals, grouping by category, summing
totals.
11. Illustrate the usage of ‘filters’, ‘group’, ‘orderBy’, ‘distinct’, and ‘load’ keywords in Pig
scripts. (Apr/May 2025)
Key Points:
 LOAD:
A = LOAD 'students.csv' USING PigStorage(',') AS (id:int, name:chararray, marks:int);
 FILTER:
B = FILTER A BY marks > 50;
 GROUP:
C = GROUP B BY name;
 ORDER BY:
D = ORDER B BY marks DESC;
 DISTINCT:
E = DISTINCT A;
 Explain with output examples showing results from each stage.

12. How will you query the data in Hive? Illustrate with data definition, data manipulation,
and data selection queries. (Nov/Dec 2024)
Key Points:
 Data Definition (DDL):
CREATE TABLE employee (id INT, name STRING, salary FLOAT);
 Data Manipulation (DML):
LOAD DATA INPATH '/data/emp.csv' INTO TABLE employee;
 Data Selection (Querying):
sql
SELECT * FROM employee;
SELECT name, salary FROM employee WHERE salary > 30000;
SELECT AVG(salary) FROM employee;
 Hive Query Processing:
o Queries internally translated into MapReduce jobs or Tez/Spark jobs.

13.i) Explain the architecture of Apache HBase and its components with a neat diagram.
(Nov/Dec 2024)
Key Points:
 Overview:
o Column-oriented NoSQL database on top of HDFS
o Designed for real-time read/write access to large datasets.
 Components:
o HMaster: Coordinates region servers
o Region Server: Manages regions (subsets of table)
o Zookeeper: Coordinates cluster health and metadata
o HFile: Actual file storage in HDFS
o MemStore + WAL (Write Ahead Log)
 Diagram:
o Include client → Zookeeper → HMaster → RegionServer → HFile flow.

13.ii) With a suitable example, discuss meta-store concepts in Hive and storage mechanisms
in HBase. (Nov/Dec 2024)
Key Points:
 Hive Metastore:
o Central repository storing metadata: table schema, partitioning info, etc.
o Backed by RDBMS (MySQL, Derby)
 Example:
o Table student(id, name) stored in /user/hive/warehouse/student/
o Metadata about the table stored in Metastore
 HBase Storage Mechanisms:
o Data stored in HFiles (column-wise), indexed by row key
o Supports sparse data, write-optimized
 Comparison:
o Hive → batch analytics
o HBase → real-time NoSQL access
14. Explain the detailed process of installing Apache Pig on a Hadoop environment. What
are the prerequisites, steps, and verification methods involved in Pig installation? (Apr/May
2025)
Key Points:
 Prerequisites:
o Java installed
o Hadoop installed and running
o Environment variables set (JAVA_HOME, HADOOP_HOME)
 Installation Steps:
1. Download Pig from Apache site
2. Extract Pig
3. Set PIG_HOME and update PATH
4. Link Pig with Hadoop
 Verification:
o Run Pig in local or mapreduce mode:
pig -x local
or
pig -x mapreduce
 Run Sample Script to verify successful execution

15. Provide examples of typical use cases for HBase and illustrate how its data model
supports them. (Apr/May 2025)
Key Points:
 Use Cases:
1. Time-series data: IoT sensors, server logs
2. Real-time analytics: Clickstream, fraud detection
3. Large-scale key-value store: Social media, messaging apps
4. Metadata storage: For Hadoop jobs or big data platforms
 Data Model Support:
o Row key: fast lookups
o Column family: logical data separation
o Cell versioning: supports historical data
o Schema flexibility
 Example:
o IoT system: row key = sensor ID, column = timestamp:value

You might also like