Unit 3
Q1. Explain the different MapReduce types and formats.
Ans :- In Hadoop MapReduce, different types and formats are
used to handle the input and output of data during the
processing. Here’s a clear explanation of the different
MapReduce types and formats:
1. Input Types/Formats
Input formats define how the input data is split and read by the
framework.
a. Text Input Format (Default)
Treats each line of the input file as a record.
Key: Byte offset of the line
Value: Content of the line
b. Key Value Input Format
Splits the line into key and value based on a separator
(default: tab).
Useful for structured key-value text files.
Key: First part of the line before separator
Value: Remaining part
c. Sequence File Input Format
Used for reading binary files in a Hadoop-specific format.
Key and value are both writable types.
d. N Line Input Format
Splits input such that each mapper gets a fixed number of
lines (e.g., 5 lines per split).
e. DB Input Format
Reads data from a relational database.
Used for integration between Hadoop and RDBMS.
2. Output Types/Formats
Output formats define how the results of the MapReduce job are
written to storage.
a. Text Output Format (Default)
Writes key-value pairs as plain text.
Key and value are separated by a tab.
b. Sequence File Output Format
Writes binary output as a Hadoop Sequence File.
c. Null Output Format
Used when output is not required.
Typically used for debugging or if the output is written to a
different system manually.
d. Multiple Outputs
Allows writing data to different files or formats depending
on the logic.
3. Custom Input and Output Formats
Developers can create their own formats by extending File
Input Format or File Output Format for specific use cases
like XML, JSON, logs, etc.
Q2. What is MapReduce? How does it facilitate
distributed data processing?
Ans :-
MapReduce is a programming model and processing technique
used for distributed data processing on large datasets across
clusters of computers using parallel and distributed algorithms.
It was introduced by Google and is widely implemented in
Apache Hadoop.
It breaks down a task into two major functions:
1. Map Function – Processes input data and produces key-
value pairs.
2. Reduce Function – Merges and aggregates the values
based on keys.
How MapReduce Facilitates Distributed Data Processing
1.Data Splitting:
a. The input data is split into fixed-size blocks (typically
128 MB or 256 MB).
b. Each block is processed independently in parallel.
2.Parallel Execution:
a. Multiple mapper tasks process different data blocks
simultaneously on different nodes.
b. After mapping, intermediate key-value pairs are
shuffled and sorted, then sent to reducer tasks.
3.Fault Tolerance:
a. Hadoop automatically handles node failures by
rerunning tasks on other nodes.
4.Data Locality:
a. MapReduce runs computation on the same node where
data is stored (HDFS), minimizing data transfer.
5.Scalability:
a. It can scale out to thousands of commodity servers for
handling petabytes of data.
6.Simplified Programming:
a. Developers only write map() and reduce() functions;
Hadoop handles parallelization, data distribution, and
fault recovery.
Workflow of MapReduce
1. Input Split → Data is divided into blocks.
2. Map Phase → Each block is processed to produce key-
value pairs.
3. Shuffle and Sort → Groups all values with the same key.
4. Reduce Phase → Aggregates or processes the grouped
data.
5. Output → Final results are written to HDFS.
Q3. Explain in details the characteristics of MongoDB.
Ans:-
MongoDB is a popular NoSQL database designed for high
performance, high availability, and easy scalability. It stores
data in a flexible, JSON-like format called BSON (Binary JSON).
Here are the key characteristics of MongoDB:
1. Document-Oriented
MongoDB stores data as documents in collections.
Documents are similar to JSON objects but stored in BSON.
Each document can have a different structure, allowing
flexible schema.
Example:
{
"_id": 1,
"name": "Kunal",
"age": 25,
"skills": ["MongoDB", "Node.js"]
}
2. Schema-Less (Flexible Schema)
Unlike traditional RDBMS, MongoDB does not require a
predefined schema.
You can insert different fields in different documents of the
same collection.
This makes it easier to evolve applications.
3. High Performance
MongoDB is optimized for read and write operations.
It supports indexing, in-memory computing, and data
replication, improving performance.
4. Horizontal Scalability
MongoDB supports sharding, which is a method of
distributing data across multiple machines.
It enables horizontal scaling by partitioning large datasets
across many servers.
5. Rich Query Language
MongoDB supports a powerful and expressive query
language.
You can perform CRUD operations, filtering, sorting,
aggregation, text search, etc.
Example:
db.users.find({ age: { $gt: 18 } }).sort({ name: 1 });
6. Indexing
Supports various types of indexes like single field,
compound, geospatial, text, and hashed indexes.
Indexing improves the performance of search queries.
7. Replication
MongoDB supports replica sets for data redundancy and
high availability.
A replica set has a primary node and multiple secondary
nodes for automatic failover.
8. Aggregation Framework
MongoDB has a built-in aggregation pipeline for data
analysis.
It allows transforming and combining data using stages like
$match, $group, $sort, $project, etc.
9. File Storage – GridFS
For storing large files (like images, videos), MongoDB
provides GridFS, which splits files into chunks and stores
them as documents.
10. ACID Transactions
MongoDB supports multi-document ACID transactions
(since version 4.0), ensuring reliable and consistent
operations similar to relational databases.
11. Open Source and Cross-Platform
MongoDB is open source and works on multiple platforms
like Windows, Linux, and macOS.
12. Integration with Modern Technologies
Easily integrates with Node.js, Python, Java, and other
technologies.
Also supports cloud deployment through MongoDB Atlas.
Q4. Explain the main components of the MapReduce
framework and how do they interact with each other?
Ans :-
Main Components of the MapReduce Framework
1.JobTracker (in Hadoop 1) / ResourceManager (in
Hadoop 2 YARN)
a. JobTracker (Hadoop 1):
i. Manages resources and job scheduling.
ii. Tracks progress of MapReduce jobs.
b. ResourceManager (YARN):
i. Manages global resources and schedules jobs
(through ApplicationMaster).
2.TaskTracker (in Hadoop 1) / NodeManager (in Hadoop
2 YARN)
a. TaskTracker (Hadoop 1):
i. Executes tasks assigned by the JobTracker.
b. NodeManager (YARN):
i. Manages resources and task execution on a single
node.
3.InputFormat
a. Defines how input data is split into InputSplits (logical
chunks).
b. Provides a RecordReader to convert these splits into
key-value pairs for the mapper.
4.Mapper
a. Processes each input key-value pair and produces
intermediate key-value pairs.
b. Runs in parallel across multiple nodes.
5.Partitioner
a. Divides the intermediate data by key (based on hash or
custom logic).
b. Ensures that all values for a key are sent to the same
reducer.
6.Shuffle and Sort
a. Shuffle: Transfers intermediate data from mappers to
reducers.
b. Sort: Sorts the data by key before sending to reducers.
7.Reducer
a. Aggregates/interprets the sorted intermediate data.
b. Produces the final output.
8.OutputFormat
a. Defines how final output key-value pairs are written to
HDFS or another storage.
Interaction Between Components
Here’s a step-by-step view of how these components work
together:
1.Job Submission
a. The client submits the job to the JobTracker (Hadoop
1) or ResourceManager (Hadoop 2).
2.Input Splitting
a. InputFormat splits the input data into InputSplits.
3.Task Assignment
a. JobTracker/ResourceManager assigns splits to
TaskTrackers/NodeManagers.
4.Map Phase
a. Each mapper processes its split (using the
RecordReader).
b. Produces intermediate key-value pairs (stored locally).
5.Partitioning
a. The Partitioner decides how to distribute intermediate
data to reducers.
6.Shuffle and Sort
a. Data is shuffled (transferred) to reducers.
b. Each reducer sorts the data by key.
7.Reduce Phase
a. Each reducer processes sorted intermediate data.
b. Produces the final output key-value pairs.
8.Output Writing
a. Output Format writes final output to HDFS or other
storage.
Q5. Explain the role of the Job Tracker in MapReduce and
how does it coordinate job execution?
Ans:-
Role of the Job Tracker:-
The Job Tracker is the master in the Hadoop MapReduce
framework (used in Hadoop 1.x). It is responsible for:
Accepting and managing MapReduce jobs submitted by
clients.
Splitting the job into map and reduce tasks.
Scheduling tasks to be executed by Task Trackers (the
workers).
Monitoring task progress and handling failures.
Coordinating the overall job execution.
How the JobTracker Coordinates Job Execution
Let’s go step by step:
1. Job Submission
The client submits a job to the JobTracker.
JobTracker receives job configuration (e.g., input path,
mapper/reducer classes).
2. Input Splitting
JobTracker uses the InputFormat to split input data into
smaller InputSplits.
Each InputSplit becomes a map task.
3. Scheduling Tasks
JobTracker assigns map tasks to TaskTrackers based on:
o Data locality (processes data on nodes where it
resides).
o Load balancing across nodes.
4. Task Monitoring
JobTracker periodically communicates with TaskTrackers
via heartbeat signals.
It tracks the progress and status of each map and reduce
task.
If a TaskTracker fails, JobTracker reassigns the task to
another TaskTracker.
5. Shuffle and Reduce Coordination
After map tasks finish, JobTracker:
o Coordinates the shuffle phase (data transfer from
mappers to reducers).
o Assigns reduce tasks to TaskTrackers.
6. Handling Failures
If a task fails, JobTracker detects the failure (via missing
heartbeats or error messages).
It restarts the task on another healthy TaskTracker.
7. Job Completion
Once all map and reduce tasks finish successfully,
JobTracker marks the job as completed.
It informs the client and writes final output to HDFS.
Q6. What is MongoDB? Explain the need of MongoDB?
Ans:-
MongoDB is an open-source NoSQL database. It is a
document-oriented database, which means it stores data in a
flexible, JSON-like format called BSON (Binary JSON).
Key Points:
Developed by MongoDB Inc.
Written in C++.
Uses collections (like tables in SQL) and documents (like
rows in SQL).
Supports dynamic schema — you don’t need to define the
schema beforehand.
Known for high performance, horizontal scalability, and
easy data modeling for modern applications.
Need of MongoDB (Why Use MongoDB?)
1.Handling Unstructured or Semi-Structured Data
a. Traditional relational databases are good for structured
data (tables, rows, columns).
b. Modern applications (social media, IoT, analytics)
generate semi-structured or unstructured data.
c. MongoDB handles such data efficiently with flexible
document schemas.
2.Dynamic Schema
a. In MongoDB, documents in the same collection can
have different fields.
b. No need to define schema upfront, which makes it easy
to change data structures as applications evolve.
3.Horizontal Scalability
a. MongoDB supports sharding, enabling horizontal
scaling by distributing data across multiple servers.
b. Useful for handling huge volumes of data (big data).
4.High Performance
a. MongoDB supports indexing, in-memory computing,
and replication.
b. This ensures fast reads and writes even with large
datasets.
5.Better Support for Modern Applications
a. MongoDB works well with JSON and RESTful APIs.
b. Ideal for web applications, real-time analytics, IoT, and
more.
6.Built-in Replication and High Availability
a. MongoDB supports replica sets for automatic failover
and data redundancy.
b. Ensures continuous availability of applications.
7.Aggregation Framework for Analytics
a. MongoDB has a powerful aggregation pipeline for data
analysis and reporting, making it suitable for analytical
workloads.
Unit 4
Q1. What is Pig Latin? How does Pig Latin handle data
transformation, filtering and aggregation operations?
Pig Latin is a high-level data flow language used in Apache
Pig, a platform for analyzing large datasets in Hadoop.
It’s designed to simplify writing MapReduce jobs.
It’s a procedural language that allows you to describe data
transformations step by step.
The code written in Pig Latin is converted into MapReduce
jobs by the Pig execution engine.
How Pig Latin Handles Data Transformation, Filtering,
and Aggregation
Pig Latin uses a set of built-in operations to work with data. Let’s
break them down:
1. Data Transformation
Pig Latin supports various transformations like:
LOAD – Load data from a source (like HDFS or local file).
FOREACH … GENERATE – Apply expressions to each
record (e.g., select columns, apply functions).
FILTER – Select records based on conditions.
SPLIT – Split data into multiple datasets based on
conditions.
JOIN – Combine datasets based on keys.
GROUP – Group data by keys.
ORDER – Sort data.
2. Data Filtering
Pig Latin uses the FILTER operator to remove unwanted
records.
Example:
Ex - data = LOAD 'file.txt' AS (name:chararray, age:int);
filtered_data = FILTER data BY age > 18;
This filters out records where age is less than or equal to 18.
3. Data Aggregation
Pig Latin supports aggregation functions like:
COUNT()
SUM()
AVG()
MIN()
MAX()
Aggregations are usually performed after GROUP.
Putting It All Together
Here’s a typical Pig Latin script flow:
pgsql
-- Load data
A = LOAD 'input.txt' AS (name:chararray, age:int, salary:int);
-- Filter data
B = FILTER A BY age > 25;
-- Transform data
C = FOREACH B GENERATE name, salary*1.1 AS new_salary;
-- Group data
D = GROUP C BY name;
-- Aggregate data
E = FOREACH D GENERATE group, SUM(C.new_salary);
-- Store result
STORE E INTO 'output';
Q2. Explain the following terms : i) Hive Shell ii)Hive
Metastore
Ans:-
i) Hive Shell
The Hive Shell is the primary command-line interface (CLI)
provided by Apache Hive, which enables users to interact with
the Hive system and perform various operations on data stored
in the Hadoop Distributed File System (HDFS). The Hive Shell
allows users to execute Hive Query Language (HQL) commands
in an interactive session, making it easier to explore,
manipulate, and query large datasets.
Key features of the Hive Shell:
It is typically launched by typing the command hive in a
terminal window.
Once inside the Hive Shell, users can issue commands to
create databases and tables, load data into tables, and
perform data analysis through HQL queries.
Users can also manage data partitions and view metadata
about tables.
The Hive Shell supports standard SQL-like commands,
providing a familiar interface for users coming from a
relational database background.
It can be used for testing and debugging queries before
deploying them in production environments.
Although it primarily operates in interactive mode, it also
allows users to run HQL scripts by passing the script file as
an argument to the Hive Shell command.
The Hive Shell is a critical tool for administrators, analysts, and
data engineers working with Hive, as it offers direct access to
the Hive execution engine and simplifies working with large-
scale data in a Hadoop ecosystem.
ii) Hive Metastore
The Hive Metastore is an essential component of Apache Hive,
responsible for storing and managing metadata about the Hive
data structures. It acts as a centralized catalog, providing
detailed information about the databases, tables, columns, data
types, and partitions used within Hive.
Key characteristics and functions of the Hive Metastore:
The Metastore typically uses a relational database such as
MySQL, PostgreSQL, or Apache Derby to store metadata in a
structured format.
It stores the schema details of Hive tables (table names,
columns, data types, and storage formats).
It keeps track of where the actual data resides in the
Hadoop Distributed File System (HDFS), including table
locations and file formats (like ORC, Parquet, or Text).
It also maintains information about partitions, which are
subsets of data within a table based on specific column
values (useful for optimizing queries).
The Hive Metastore provides APIs that allow other Hadoop
ecosystem tools (like Apache Spark, Presto, or Impala) to
access this metadata, making it a central component for
data discovery and processing.
The Metastore supports two modes: embedded (where the
database runs in the same process as Hive) and remote
(where the Metastore service runs as a separate process,
allowing multiple clients to access it concurrently).
Example of metadata stored in the Hive Metastore for a
table:
Table name: employees
Columns: id (int), name (string), department (string), salary
(float)
Location: /user/hive/warehouse/employees
Partition keys: department
The Hive Metastore plays a vital role in ensuring data
consistency and helping Hive and other tools access data
efficiently. It simplifies data management in a distributed
environment by acting as a single source of truth for metadata.
Q3. Differentiate between HBase and Relational
Database Management System
Ans:-
Query SQL ( Structured Query No-SQL (non-relational)
Langua language)
ge
Schema It has a fixed schema. It has dynamic schema.
Databas Structured Unstructured / Semi-
e Type structured.
Scalabil RDBMS allows ( Vertical HBase allows
ity Scaling ). That means, ( Horizontal scaling
rather to adding new ) , means when we
servers, we should upgrade require extra memory
the current server to a more and disc space, we
capable server whenever must add new servers
there is a requirement for to the cluster rather
more memory, processing than upgrade the
power, and disc space. existing ones.
Nature It is static in nature Dynamic in nature
Data In RDBMS, slower retrieval of In HBase, faster
retrieva data. retrieval of data.
l
Rule It follows the ACID It follows CAP
(Atomicity, Consistency, (Consistency,
Isolation, and Durability) Availability, Partition-
property. tolerance) theorem.
Sparse It cannot handle sparse data. It can handle sparse
data data.
Volume The amount of data in In HBase, the amount of
of data RDBMS is determined by the data depends on the
server's configuration. number of machines
deployed rather than on
a single machine.
Transac In RDBMS, mostly there is a In HBase, there is no
tion guarantee associated with such guarantee
Integrit transaction integrity. associated with the
y transaction integrity.
Referen Referential integrity is When it comes to
tial supported by RDBMS. referential integrity, no
Integrit built-in support is
y available.
Normali In RDBMS, you can normalize The data in HBase is
ze the data. not normalized, which
means there is no
logical relationship or
connection between
distinct tables of data.
Setup Simple to Design Complex , it must
Comple require Hadoop
xity ecosystem.
Use Transaction-heavy Big Data real-time
Cases applications analytics.
Data No automatic partitioning , Automatic partitioning
Partitio although some system occur.
ning support data sharding .
Backup RDBMS provide native Backup & recovery
& recovery & backup option. mechanism is complex
Recover and depends on
y underlying Hadoop
infrastructure.
Q4. What is PIG in Hadoop Eco System? Explain the
different execution modes of PIG.
Ans:-
Apache Pig is a high-level data flow scripting language and
platform built on top of Hadoop. It was designed to simplify the
process of writing MapReduce programs by providing an easy-
to-understand scripting language called Pig Latin, which
resembles SQL but is more procedural. Pig allows data analysts
and developers to express data transformations, such as
filtering, grouping, joining, and sorting, in a concise way without
needing to write complex Java MapReduce code.
Pig scripts are automatically compiled into a series of
MapReduce jobs that run on a Hadoop cluster, abstracting the
complexity of distributed processing. Pig supports both
structured and semi-structured data, making it a versatile tool in
the big data ecosystem.
Different Execution Modes of PIG
1.Local Mode
a. In Local Mode, Pig runs on a single machine using the
local file system instead of Hadoop Distributed File
System (HDFS).
b. This mode does not require a Hadoop cluster; it is
useful for small datasets, development, and debugging
purposes.
c. Execution is limited by the local machine’s resources,
so it is not suitable for large-scale data processing.
d. The Pig script processes files located on the local
filesystem.
2.MapReduce Mode (Hadoop Mode)
a. In MapReduce Mode, Pig runs on a fully distributed
Hadoop cluster.
b. The Pig Latin scripts are compiled into one or more
MapReduce jobs that are executed across the cluster in
parallel, leveraging Hadoop’s scalability and fault
tolerance.
c. Data input and output occur in HDFS, enabling efficient
processing of large datasets distributed over multiple
nodes.
d. This mode is suitable for production environments
where big data is processed at scale.
Q5. Explain the main components of Apache Hive and
how do they interact with each other?
Ans:-
Main Components of Apache Hive:
1.Hive Driver:
a. Acts as the controller that receives the HiveQL queries
from the user.
b. Manages the lifecycle of a HiveQL query by compiling,
optimizing, and executing it.
c. Sends the query execution plan to the execution
engine.
2.Compiler:
a. Converts HiveQL queries into execution plans.
b. Performs parsing, semantic analysis, and query
optimization.
c. Translates queries into a Directed Acyclic Graph (DAG)
of MapReduce jobs or other execution tasks.
3.Execution Engine:
a. Executes the tasks as per the compiled execution plan.
b. Manages task scheduling and monitors job execution
on the Hadoop cluster.
c. Interacts with Hadoop’s MapReduce or Tez or Spark
frameworks to process data.
4.Metastore:
a. A centralized repository storing metadata about Hive
tables, partitions, schemas, and data types.
b. Maintains information about the location of data on
HDFS and table definitions.
c. Supports metadata querying and is accessible by
different components during query compilation and
execution.
5.Driver Interface / CLI / UI:
a. Interfaces through which users submit HiveQL queries.
b. Can be command-line interface (CLI), web UI, or
JDBC/ODBC drivers for external applications.
How These Components Interact:
The user submits a HiveQL query via the CLI or a driver
interface.
The Hive Driver receives the query and forwards it to the
Compiler.
The Compiler parses the query, checks syntax and
semantics, and uses the Meta store to retrieve metadata.
The compiler then generates an optimized execution plan,
often a series of MapReduce jobs or DAGs for other engines.
The execution plan is sent to the Execution Engine, which
interacts with Hadoop’s processing framework to execute
the jobs.
As the jobs run, the Execution Engine tracks progress and
collects results.
Once execution is complete, results are returned to the user
through the driver interface.
Q6. Explain the key features and characteristics of HBase
that differentiate it from traditional relational databases.
Ans:-
Key Features and Characteristics of HBase:
1. NoSQL, Column-Oriented Storage:
HBase is a NoSQL database that stores data in a column-
oriented format rather than rows, allowing efficient storage
and retrieval of sparse data.
2. Schema Flexibility:
HBase has a flexible schema design where columns can be
added dynamically without altering existing data, unlike the
fixed schema of relational databases.
3. Horizontal Scalability:
HBase is designed to scale out by distributing data across
many commodity servers, supporting massive datasets and
high throughput.
4. Built on Hadoop HDFS:
It stores data on the Hadoop Distributed File System
(HDFS), benefiting from fault tolerance, replication, and
scalability inherent to Hadoop.
5. Strong Consistency at Row Level:
HBase guarantees strong consistency for read/write
operations on individual rows but does not support multi-
row transactions.
6. Sparse Data Storage:
It efficiently stores sparse data where many columns can
be empty, saving storage space compared to traditional
relational databases.
7. No Support for Joins and Foreign Keys:
HBase does not support relational operations like joins or
foreign key constraints; data is typically denormalized for
performance.
8. Real-Time Read/Write Access:
HBase supports fast, random, real-time read and write
access to large datasets, making it suitable for time-
sensitive applications.
9. Data Versioning:
HBase maintains multiple versions of data for each cell,
allowing historical data retrieval and audit capabilities.
10. API-Driven Access:
Data access is via APIs such as Java, REST, or Thrift, rather
than SQL, though some SQL-like querying is possible
through layers like Apache Phoenix.