0% found this document useful (0 votes)

11 views20 pages

BDA

The document provides an overview of MapReduce types and formats, explaining input and output formats used in Hadoop for data processing. It also details the characteristics of MongoDB, highlighting its document-oriented structure, flexible schema, and high performance. Additionally, it describes the main components of the MapReduce framework, the role of the Job Tracker in coordinating job execution, and introduces Pig Latin for data transformation in Apache Pig.

Uploaded by

Robert Stark

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views20 pages

BDA

Uploaded by

Robert Stark

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 20

Unit 3

Q1. Explain the different MapReduce types and formats.

Ans :- In Hadoop MapReduce, different types and formats are
used to handle the input and output of data during the
processing. Here’s a clear explanation of the different
MapReduce types and formats:
1. Input Types/Formats
Input formats define how the input data is split and read by the
framework.
a. Text Input Format (Default)
 Treats each line of the input file as a record.
 Key: Byte offset of the line
 Value: Content of the line
b. Key Value Input Format
 Splits the line into key and value based on a separator
(default: tab).
 Useful for structured key-value text files.
 Key: First part of the line before separator
 Value: Remaining part
c. Sequence File Input Format
 Used for reading binary files in a Hadoop-specific format.
 Key and value are both writable types.
d. N Line Input Format
 Splits input such that each mapper gets a fixed number of
lines (e.g., 5 lines per split).
e. DB Input Format
 Reads data from a relational database.
 Used for integration between Hadoop and RDBMS.
2. Output Types/Formats
Output formats define how the results of the MapReduce job are
written to storage.
a. Text Output Format (Default)
 Writes key-value pairs as plain text.
 Key and value are separated by a tab.
b. Sequence File Output Format
 Writes binary output as a Hadoop Sequence File.
c. Null Output Format
 Used when output is not required.
 Typically used for debugging or if the output is written to a
different system manually.
d. Multiple Outputs
 Allows writing data to different files or formats depending
on the logic.
3. Custom Input and Output Formats
 Developers can create their own formats by extending File
Input Format or File Output Format for specific use cases
like XML, JSON, logs, etc.

Q2. What is MapReduce? How does it facilitate

distributed data processing?

Ans :-
MapReduce is a programming model and processing technique
used for distributed data processing on large datasets across
clusters of computers using parallel and distributed algorithms.
It was introduced by Google and is widely implemented in
Apache Hadoop.
It breaks down a task into two major functions:
1. Map Function – Processes input data and produces key-
value pairs.
2. Reduce Function – Merges and aggregates the values
based on keys.

How MapReduce Facilitates Distributed Data Processing

1.Data Splitting:
a. The input data is split into fixed-size blocks (typically
128 MB or 256 MB).
b. Each block is processed independently in parallel.
2.Parallel Execution:
a. Multiple mapper tasks process different data blocks
simultaneously on different nodes.
b. After mapping, intermediate key-value pairs are
shuffled and sorted, then sent to reducer tasks.
3.Fault Tolerance:
a. Hadoop automatically handles node failures by
rerunning tasks on other nodes.
4.Data Locality:
a. MapReduce runs computation on the same node where
data is stored (HDFS), minimizing data transfer.
5.Scalability:
a. It can scale out to thousands of commodity servers for
handling petabytes of data.
6.Simplified Programming:
a. Developers only write map() and reduce() functions;
Hadoop handles parallelization, data distribution, and
fault recovery.

Workflow of MapReduce
1. Input Split → Data is divided into blocks.
2. Map Phase → Each block is processed to produce key-
value pairs.
3. Shuffle and Sort → Groups all values with the same key.
4. Reduce Phase → Aggregates or processes the grouped
data.
5. Output → Final results are written to HDFS.

Q3. Explain in details the characteristics of MongoDB.

Ans:-
MongoDB is a popular NoSQL database designed for high
performance, high availability, and easy scalability. It stores
data in a flexible, JSON-like format called BSON (Binary JSON).
Here are the key characteristics of MongoDB:
1. Document-Oriented
 MongoDB stores data as documents in collections.
 Documents are similar to JSON objects but stored in BSON.
 Each document can have a different structure, allowing
flexible schema.
Example:
{
"_id": 1,
"name": "Kunal",
"age": 25,
"skills": ["MongoDB", "Node.js"]
}

2. Schema-Less (Flexible Schema)

 Unlike traditional RDBMS, MongoDB does not require a
predefined schema.
 You can insert different fields in different documents of the
same collection.
 This makes it easier to evolve applications.
3. High Performance
 MongoDB is optimized for read and write operations.
 It supports indexing, in-memory computing, and data
replication, improving performance.
4. Horizontal Scalability
 MongoDB supports sharding, which is a method of
distributing data across multiple machines.
 It enables horizontal scaling by partitioning large datasets
across many servers.
5. Rich Query Language
 MongoDB supports a powerful and expressive query
language.
 You can perform CRUD operations, filtering, sorting,
aggregation, text search, etc.
Example:
db.users.find({ age: { $gt: 18 } }).sort({ name: 1 });
6. Indexing
 Supports various types of indexes like single field,
compound, geospatial, text, and hashed indexes.
 Indexing improves the performance of search queries.
7. Replication
 MongoDB supports replica sets for data redundancy and
high availability.
 A replica set has a primary node and multiple secondary
nodes for automatic failover.
8. Aggregation Framework
 MongoDB has a built-in aggregation pipeline for data
analysis.
 It allows transforming and combining data using stages like
$match, $group, $sort, $project, etc.
9. File Storage – GridFS
 For storing large files (like images, videos), MongoDB
provides GridFS, which splits files into chunks and stores
them as documents.
10. ACID Transactions
 MongoDB supports multi-document ACID transactions
(since version 4.0), ensuring reliable and consistent
operations similar to relational databases.
11. Open Source and Cross-Platform
 MongoDB is open source and works on multiple platforms
like Windows, Linux, and macOS.
12. Integration with Modern Technologies
 Easily integrates with Node.js, Python, Java, and other
technologies.
 Also supports cloud deployment through MongoDB Atlas.

Q4. Explain the main components of the MapReduce

framework and how do they interact with each other?

Ans :-
Main Components of the MapReduce Framework
1.JobTracker (in Hadoop 1) / ResourceManager (in
Hadoop 2 YARN)
a. JobTracker (Hadoop 1):
i. Manages resources and job scheduling.
ii. Tracks progress of MapReduce jobs.
b. ResourceManager (YARN):
i. Manages global resources and schedules jobs
(through ApplicationMaster).
2.TaskTracker (in Hadoop 1) / NodeManager (in Hadoop
2 YARN)
a. TaskTracker (Hadoop 1):
i. Executes tasks assigned by the JobTracker.
b. NodeManager (YARN):
i. Manages resources and task execution on a single
node.
3.InputFormat
a. Defines how input data is split into InputSplits (logical
chunks).
b. Provides a RecordReader to convert these splits into
key-value pairs for the mapper.
4.Mapper
a. Processes each input key-value pair and produces
intermediate key-value pairs.
b. Runs in parallel across multiple nodes.
5.Partitioner
a. Divides the intermediate data by key (based on hash or
custom logic).
b. Ensures that all values for a key are sent to the same
reducer.
6.Shuffle and Sort
a. Shuffle: Transfers intermediate data from mappers to
reducers.
b. Sort: Sorts the data by key before sending to reducers.
7.Reducer
a. Aggregates/interprets the sorted intermediate data.
b. Produces the final output.
8.OutputFormat
a. Defines how final output key-value pairs are written to
HDFS or another storage.

Interaction Between Components

Here’s a step-by-step view of how these components work
together:
1.Job Submission
a. The client submits the job to the JobTracker (Hadoop
1) or ResourceManager (Hadoop 2).
2.Input Splitting
a. InputFormat splits the input data into InputSplits.
3.Task Assignment
a. JobTracker/ResourceManager assigns splits to
TaskTrackers/NodeManagers.
4.Map Phase
a. Each mapper processes its split (using the
RecordReader).
b. Produces intermediate key-value pairs (stored locally).
5.Partitioning
a. The Partitioner decides how to distribute intermediate
data to reducers.
6.Shuffle and Sort
a. Data is shuffled (transferred) to reducers.
b. Each reducer sorts the data by key.
7.Reduce Phase
a. Each reducer processes sorted intermediate data.
b. Produces the final output key-value pairs.
8.Output Writing
a. Output Format writes final output to HDFS or other
storage.
Q5. Explain the role of the Job Tracker in MapReduce and
how does it coordinate job execution?

Ans:-
Role of the Job Tracker:-
The Job Tracker is the master in the Hadoop MapReduce
framework (used in Hadoop 1.x). It is responsible for:
Accepting and managing MapReduce jobs submitted by
clients.
Splitting the job into map and reduce tasks.
Scheduling tasks to be executed by Task Trackers (the
workers).
Monitoring task progress and handling failures.
Coordinating the overall job execution.

How the JobTracker Coordinates Job Execution

Let’s go step by step:
1. Job Submission
 The client submits a job to the JobTracker.
 JobTracker receives job configuration (e.g., input path,
mapper/reducer classes).
2. Input Splitting
 JobTracker uses the InputFormat to split input data into
smaller InputSplits.
 Each InputSplit becomes a map task.
3. Scheduling Tasks
 JobTracker assigns map tasks to TaskTrackers based on:
o Data locality (processes data on nodes where it
resides).
o Load balancing across nodes.
4. Task Monitoring
 JobTracker periodically communicates with TaskTrackers
via heartbeat signals.
 It tracks the progress and status of each map and reduce
task.
 If a TaskTracker fails, JobTracker reassigns the task to
another TaskTracker.
5. Shuffle and Reduce Coordination
 After map tasks finish, JobTracker:
o Coordinates the shuffle phase (data transfer from
mappers to reducers).
o Assigns reduce tasks to TaskTrackers.
6. Handling Failures
 If a task fails, JobTracker detects the failure (via missing
heartbeats or error messages).
 It restarts the task on another healthy TaskTracker.
7. Job Completion
 Once all map and reduce tasks finish successfully,
JobTracker marks the job as completed.
 It informs the client and writes final output to HDFS.

Q6. What is MongoDB? Explain the need of MongoDB?

Ans:-
MongoDB is an open-source NoSQL database. It is a
document-oriented database, which means it stores data in a
flexible, JSON-like format called BSON (Binary JSON).
Key Points:
 Developed by MongoDB Inc.
 Written in C++.
 Uses collections (like tables in SQL) and documents (like
rows in SQL).
 Supports dynamic schema — you don’t need to define the
schema beforehand.
 Known for high performance, horizontal scalability, and
easy data modeling for modern applications.
Need of MongoDB (Why Use MongoDB?)
1.Handling Unstructured or Semi-Structured Data
a. Traditional relational databases are good for structured
data (tables, rows, columns).
b. Modern applications (social media, IoT, analytics)
generate semi-structured or unstructured data.
c. MongoDB handles such data efficiently with flexible
document schemas.
2.Dynamic Schema
a. In MongoDB, documents in the same collection can
have different fields.
b. No need to define schema upfront, which makes it easy
to change data structures as applications evolve.

3.Horizontal Scalability
a. MongoDB supports sharding, enabling horizontal
scaling by distributing data across multiple servers.
b. Useful for handling huge volumes of data (big data).
4.High Performance
a. MongoDB supports indexing, in-memory computing,
and replication.
b. This ensures fast reads and writes even with large
datasets.

5.Better Support for Modern Applications

a. MongoDB works well with JSON and RESTful APIs.
b. Ideal for web applications, real-time analytics, IoT, and
more.
6.Built-in Replication and High Availability
a. MongoDB supports replica sets for automatic failover
and data redundancy.
b. Ensures continuous availability of applications.
7.Aggregation Framework for Analytics
a. MongoDB has a powerful aggregation pipeline for data
analysis and reporting, making it suitable for analytical
workloads.

Unit 4

Q1. What is Pig Latin? How does Pig Latin handle data
transformation, filtering and aggregation operations?

Pig Latin is a high-level data flow language used in Apache

Pig, a platform for analyzing large datasets in Hadoop.
 It’s designed to simplify writing MapReduce jobs.
 It’s a procedural language that allows you to describe data
transformations step by step.
 The code written in Pig Latin is converted into MapReduce
jobs by the Pig execution engine.

How Pig Latin Handles Data Transformation, Filtering,

and Aggregation
Pig Latin uses a set of built-in operations to work with data. Let’s
break them down:

1. Data Transformation
Pig Latin supports various transformations like:
 LOAD – Load data from a source (like HDFS or local file).
 FOREACH … GENERATE – Apply expressions to each
record (e.g., select columns, apply functions).
 FILTER – Select records based on conditions.
 SPLIT – Split data into multiple datasets based on
conditions.
 JOIN – Combine datasets based on keys.
 GROUP – Group data by keys.
 ORDER – Sort data.

2. Data Filtering
Pig Latin uses the FILTER operator to remove unwanted
records.
Example:
Ex - data = LOAD 'file.txt' AS (name:chararray, age:int);
filtered_data = FILTER data BY age > 18;

This filters out records where age is less than or equal to 18.

3. Data Aggregation
Pig Latin supports aggregation functions like:
 COUNT()
 SUM()
 AVG()
 MIN()
 MAX()
Aggregations are usually performed after GROUP.
Putting It All Together
Here’s a typical Pig Latin script flow:
pgsql
-- Load data
A = LOAD 'input.txt' AS (name:chararray, age:int, salary:int);
-- Filter data
B = FILTER A BY age > 25;
-- Transform data
C = FOREACH B GENERATE name, salary*1.1 AS new_salary;
-- Group data
D = GROUP C BY name;
-- Aggregate data
E = FOREACH D GENERATE group, SUM(C.new_salary);
-- Store result
STORE E INTO 'output';
Q2. Explain the following terms : i) Hive Shell ii)Hive
Metastore
Ans:-

i) Hive Shell
The Hive Shell is the primary command-line interface (CLI)
provided by Apache Hive, which enables users to interact with
the Hive system and perform various operations on data stored
in the Hadoop Distributed File System (HDFS). The Hive Shell
allows users to execute Hive Query Language (HQL) commands
in an interactive session, making it easier to explore,
manipulate, and query large datasets.
Key features of the Hive Shell:
 It is typically launched by typing the command hive in a
terminal window.
 Once inside the Hive Shell, users can issue commands to
create databases and tables, load data into tables, and
perform data analysis through HQL queries.
 Users can also manage data partitions and view metadata
about tables.
 The Hive Shell supports standard SQL-like commands,
providing a familiar interface for users coming from a
relational database background.
 It can be used for testing and debugging queries before
deploying them in production environments.
 Although it primarily operates in interactive mode, it also
allows users to run HQL scripts by passing the script file as
an argument to the Hive Shell command.
The Hive Shell is a critical tool for administrators, analysts, and
data engineers working with Hive, as it offers direct access to
the Hive execution engine and simplifies working with large-
scale data in a Hadoop ecosystem.

ii) Hive Metastore

The Hive Metastore is an essential component of Apache Hive,
responsible for storing and managing metadata about the Hive
data structures. It acts as a centralized catalog, providing
detailed information about the databases, tables, columns, data
types, and partitions used within Hive.
Key characteristics and functions of the Hive Metastore:
 The Metastore typically uses a relational database such as
MySQL, PostgreSQL, or Apache Derby to store metadata in a
structured format.
 It stores the schema details of Hive tables (table names,
columns, data types, and storage formats).
 It keeps track of where the actual data resides in the
Hadoop Distributed File System (HDFS), including table
locations and file formats (like ORC, Parquet, or Text).
 It also maintains information about partitions, which are
subsets of data within a table based on specific column
values (useful for optimizing queries).
 The Hive Metastore provides APIs that allow other Hadoop
ecosystem tools (like Apache Spark, Presto, or Impala) to
access this metadata, making it a central component for
data discovery and processing.
 The Metastore supports two modes: embedded (where the
database runs in the same process as Hive) and remote
(where the Metastore service runs as a separate process,
allowing multiple clients to access it concurrently).
Example of metadata stored in the Hive Metastore for a
table:
 Table name: employees
 Columns: id (int), name (string), department (string), salary
(float)
 Location: /user/hive/warehouse/employees
 Partition keys: department
The Hive Metastore plays a vital role in ensuring data
consistency and helping Hive and other tools access data
efficiently. It simplifies data management in a distributed
environment by acting as a single source of truth for metadata.

Q3. Differentiate between HBase and Relational

Database Management System
Ans:-
Query SQL ( Structured Query No-SQL (non-relational)
Langua language)
ge
Schema It has a fixed schema. It has dynamic schema.
Databas Structured Unstructured / Semi-
e Type structured.
Scalabil RDBMS allows ( Vertical HBase allows
ity Scaling ). That means, ( Horizontal scaling
rather to adding new ) , means when we
servers, we should upgrade require extra memory
the current server to a more and disc space, we
capable server whenever must add new servers
there is a requirement for to the cluster rather
more memory, processing than upgrade the
power, and disc space. existing ones.
Nature It is static in nature Dynamic in nature
Data In RDBMS, slower retrieval of In HBase, faster
retrieva data. retrieval of data.
l
Rule It follows the ACID It follows CAP
(Atomicity, Consistency, (Consistency,
Isolation, and Durability) Availability, Partition-
property. tolerance) theorem.
Sparse It cannot handle sparse data. It can handle sparse
data data.
Volume The amount of data in In HBase, the amount of
of data RDBMS is determined by the data depends on the
server's configuration. number of machines
deployed rather than on
a single machine.
Transac In RDBMS, mostly there is a In HBase, there is no
tion guarantee associated with such guarantee
Integrit transaction integrity. associated with the
y transaction integrity.
Referen Referential integrity is When it comes to
tial supported by RDBMS. referential integrity, no
Integrit built-in support is
y available.
Normali In RDBMS, you can normalize The data in HBase is
ze the data. not normalized, which
means there is no
logical relationship or
connection between
distinct tables of data.
Setup Simple to Design Complex , it must
Comple require Hadoop
xity ecosystem.
Use Transaction-heavy Big Data real-time
Cases applications analytics.
Data No automatic partitioning , Automatic partitioning
Partitio although some system occur.
ning support data sharding .
Backup RDBMS provide native Backup & recovery
& recovery & backup option. mechanism is complex
Recover and depends on
y underlying Hadoop
infrastructure.

Q4. What is PIG in Hadoop Eco System? Explain the

different execution modes of PIG.
Ans:-
Apache Pig is a high-level data flow scripting language and
platform built on top of Hadoop. It was designed to simplify the
process of writing MapReduce programs by providing an easy-
to-understand scripting language called Pig Latin, which
resembles SQL but is more procedural. Pig allows data analysts
and developers to express data transformations, such as
filtering, grouping, joining, and sorting, in a concise way without
needing to write complex Java MapReduce code.
Pig scripts are automatically compiled into a series of
MapReduce jobs that run on a Hadoop cluster, abstracting the
complexity of distributed processing. Pig supports both
structured and semi-structured data, making it a versatile tool in
the big data ecosystem.

Different Execution Modes of PIG

1.Local Mode
a. In Local Mode, Pig runs on a single machine using the
local file system instead of Hadoop Distributed File
System (HDFS).
b. This mode does not require a Hadoop cluster; it is
useful for small datasets, development, and debugging
purposes.
c. Execution is limited by the local machine’s resources,
so it is not suitable for large-scale data processing.
d. The Pig script processes files located on the local
filesystem.
2.MapReduce Mode (Hadoop Mode)
a. In MapReduce Mode, Pig runs on a fully distributed
Hadoop cluster.
b. The Pig Latin scripts are compiled into one or more
MapReduce jobs that are executed across the cluster in
parallel, leveraging Hadoop’s scalability and fault
tolerance.
c. Data input and output occur in HDFS, enabling efficient
processing of large datasets distributed over multiple
nodes.
d. This mode is suitable for production environments
where big data is processed at scale.

Q5. Explain the main components of Apache Hive and

how do they interact with each other?
Ans:-
Main Components of Apache Hive:
1.Hive Driver:
a. Acts as the controller that receives the HiveQL queries
from the user.
b. Manages the lifecycle of a HiveQL query by compiling,
optimizing, and executing it.
c. Sends the query execution plan to the execution
engine.
2.Compiler:
a. Converts HiveQL queries into execution plans.
b. Performs parsing, semantic analysis, and query
optimization.
c. Translates queries into a Directed Acyclic Graph (DAG)
of MapReduce jobs or other execution tasks.
3.Execution Engine:
a. Executes the tasks as per the compiled execution plan.
b. Manages task scheduling and monitors job execution
on the Hadoop cluster.
c. Interacts with Hadoop’s MapReduce or Tez or Spark
frameworks to process data.
4.Metastore:
a. A centralized repository storing metadata about Hive
tables, partitions, schemas, and data types.
b. Maintains information about the location of data on
HDFS and table definitions.
c. Supports metadata querying and is accessible by
different components during query compilation and
execution.
5.Driver Interface / CLI / UI:
a. Interfaces through which users submit HiveQL queries.
b. Can be command-line interface (CLI), web UI, or
JDBC/ODBC drivers for external applications.

How These Components Interact:

 The user submits a HiveQL query via the CLI or a driver
interface.
 The Hive Driver receives the query and forwards it to the
Compiler.
 The Compiler parses the query, checks syntax and
semantics, and uses the Meta store to retrieve metadata.
 The compiler then generates an optimized execution plan,
often a series of MapReduce jobs or DAGs for other engines.
 The execution plan is sent to the Execution Engine, which
interacts with Hadoop’s processing framework to execute
the jobs.
 As the jobs run, the Execution Engine tracks progress and
collects results.
 Once execution is complete, results are returned to the user
through the driver interface.

Q6. Explain the key features and characteristics of HBase

that differentiate it from traditional relational databases.
Ans:-
Key Features and Characteristics of HBase:
1. NoSQL, Column-Oriented Storage:
HBase is a NoSQL database that stores data in a column-
oriented format rather than rows, allowing efficient storage
and retrieval of sparse data.
2. Schema Flexibility:
HBase has a flexible schema design where columns can be
added dynamically without altering existing data, unlike the
fixed schema of relational databases.
3. Horizontal Scalability:
HBase is designed to scale out by distributing data across
many commodity servers, supporting massive datasets and
high throughput.
4. Built on Hadoop HDFS:
It stores data on the Hadoop Distributed File System
(HDFS), benefiting from fault tolerance, replication, and
scalability inherent to Hadoop.
5. Strong Consistency at Row Level:
HBase guarantees strong consistency for read/write
operations on individual rows but does not support multi-
row transactions.
6. Sparse Data Storage:
It efficiently stores sparse data where many columns can
be empty, saving storage space compared to traditional
relational databases.
7. No Support for Joins and Foreign Keys:
HBase does not support relational operations like joins or
foreign key constraints; data is typically denormalized for
performance.
8. Real-Time Read/Write Access:
HBase supports fast, random, real-time read and write
access to large datasets, making it suitable for time-
sensitive applications.
9. Data Versioning:
HBase maintains multiple versions of data for each cell,
allowing historical data retrieval and audit capabilities.
10. API-Driven Access:
Data access is via APIs such as Java, REST, or Thrift, rather
than SQL, though some SQL-like querying is possible
through layers like Apache Phoenix.

Top Answers To Map Reduce Interview Questions
No ratings yet
Top Answers To Map Reduce Interview Questions
6 pages
Big Data Analysis PDF 2
No ratings yet
Big Data Analysis PDF 2
18 pages
Bda Unit 3
No ratings yet
Bda Unit 3
22 pages
BDH Unit 1
No ratings yet
BDH Unit 1
14 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
7 pages
P.Prabu (23x61c) CCS334-BDA - Unit-3
No ratings yet
P.Prabu (23x61c) CCS334-BDA - Unit-3
23 pages
Bda Winter 2021 Solution
No ratings yet
Bda Winter 2021 Solution
27 pages
Bda Unit-3
No ratings yet
Bda Unit-3
44 pages
Bda Unit 3
No ratings yet
Bda Unit 3
14 pages
Hadoop Karunesh
No ratings yet
Hadoop Karunesh
14 pages
Hadoop Interview Prep Guide
No ratings yet
Hadoop Interview Prep Guide
16 pages
MapReduce for Big Data Analysis
No ratings yet
MapReduce for Big Data Analysis
59 pages
Unit 2
No ratings yet
Unit 2
7 pages
BDM 2
No ratings yet
BDM 2
5 pages
BDA Unit-3
No ratings yet
BDA Unit-3
63 pages
MapReduce Architecture
No ratings yet
MapReduce Architecture
5 pages
Big Data Unit - 3
No ratings yet
Big Data Unit - 3
7 pages
BDA Notes
No ratings yet
BDA Notes
15 pages
Map Reduce Intro
No ratings yet
Map Reduce Intro
21 pages
BDA Unit 4 PDF
No ratings yet
BDA Unit 4 PDF
31 pages
BDA Unit 3 Notes
No ratings yet
BDA Unit 3 Notes
11 pages
Hadoop: Er. Gursewak Singh Dsce
No ratings yet
Hadoop: Er. Gursewak Singh Dsce
15 pages
Cloud Unit 5
No ratings yet
Cloud Unit 5
52 pages
Shortnotes For Cloud
No ratings yet
Shortnotes For Cloud
22 pages
Big Data 2021-2022
No ratings yet
Big Data 2021-2022
18 pages
Cloud Computing Prof
No ratings yet
Cloud Computing Prof
11 pages
L02-Hadoop Framework
No ratings yet
L02-Hadoop Framework
40 pages
Unit 5
No ratings yet
Unit 5
35 pages
Bda Unit 2
No ratings yet
Bda Unit 2
21 pages
HadoopMapreduce Summerization
No ratings yet
HadoopMapreduce Summerization
24 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
18 Module 2
No ratings yet
18 Module 2
9 pages
BDA Unit-2
No ratings yet
BDA Unit-2
11 pages
Questionsand Answers
No ratings yet
Questionsand Answers
23 pages
Bda-3 Unit
No ratings yet
Bda-3 Unit
23 pages
? Unit 2, 3 Big Data Notes
No ratings yet
? Unit 2, 3 Big Data Notes
12 pages
IMTC634 - Data Science - Chapter 13
No ratings yet
IMTC634 - Data Science - Chapter 13
16 pages
Chapter 25
No ratings yet
Chapter 25
43 pages
The CAP Theorem Overview
No ratings yet
The CAP Theorem Overview
16 pages
Hadoop Framework & MapReduce Guide
No ratings yet
Hadoop Framework & MapReduce Guide
11 pages
Bda U3, U4 and U5 Two Marks Qs
No ratings yet
Bda U3, U4 and U5 Two Marks Qs
19 pages
Unit 2
No ratings yet
Unit 2
22 pages
BDA - Unit 3
No ratings yet
BDA - Unit 3
41 pages
Map Reduce
No ratings yet
Map Reduce
40 pages
Bda Unit 3
No ratings yet
Bda Unit 3
14 pages
MapReduce Unit3
No ratings yet
MapReduce Unit3
27 pages
3 Unit
No ratings yet
3 Unit
17 pages
Bda Unit 2
No ratings yet
Bda Unit 2
16 pages
HADOOP
No ratings yet
HADOOP
19 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
27 pages
Unit 4 1
No ratings yet
Unit 4 1
12 pages
Unit IV Notes
No ratings yet
Unit IV Notes
25 pages
Hadoop MapReduce for Developers
No ratings yet
Hadoop MapReduce for Developers
4 pages
Bda Unit 2
No ratings yet
Bda Unit 2
48 pages
S MapReduce Types Formats
100% (2)
S MapReduce Types Formats
22 pages
WWW Doubtly in Big Data Analytics Semester 7 Mu Ai Ds Viva Qna
No ratings yet
WWW Doubtly in Big Data Analytics Semester 7 Mu Ai Ds Viva Qna
7 pages
BDA Viva
No ratings yet
BDA Viva
26 pages
AI & ML B.E. Student List 2024
No ratings yet
AI & ML B.E. Student List 2024
2 pages
Deep Learning U6
No ratings yet
Deep Learning U6
8 pages
Index
No ratings yet
Index
1 page
1st Page
No ratings yet
1st Page
1 page
Artificial Intelligence Tutelage System
No ratings yet
Artificial Intelligence Tutelage System
5 pages
Biometrics in Secure E-Transaction
No ratings yet
Biometrics in Secure E-Transaction
22 pages
Hospital Management Srs
100% (2)
Hospital Management Srs
6 pages
Rmarkdown::: Cheat Sheet
No ratings yet
Rmarkdown::: Cheat Sheet
2 pages
Sample Questions For Midterm Exam - CSE215 (Sec 19) - Spring2024
No ratings yet
Sample Questions For Midterm Exam - CSE215 (Sec 19) - Spring2024
2 pages
Human Pose Estimation
No ratings yet
Human Pose Estimation
4 pages
CS505-P Update Mcqs FinalTerm by Vu Topper RM
100% (1)
CS505-P Update Mcqs FinalTerm by Vu Topper RM
18 pages
Cyber Security Trends in Modern Automobile Industry/sector
100% (1)
Cyber Security Trends in Modern Automobile Industry/sector
51 pages
Asterisk vs. ShoreTel vs. Cisco PBX Comparison
No ratings yet
Asterisk vs. ShoreTel vs. Cisco PBX Comparison
4 pages
Serializability in DBMS Explained
No ratings yet
Serializability in DBMS Explained
13 pages
690+ Ac Drive Manual 2
No ratings yet
690+ Ac Drive Manual 2
164 pages
Computer Fundamentals Overview
No ratings yet
Computer Fundamentals Overview
22 pages
Basics of ICT in Education
No ratings yet
Basics of ICT in Education
40 pages
STARTGUIDE eSIM
No ratings yet
STARTGUIDE eSIM
48 pages
CURSORS
No ratings yet
CURSORS
42 pages
2G Integration Steps and MML Updates
No ratings yet
2G Integration Steps and MML Updates
4 pages
PS 9.1 Re-Implementation - Employee Data Security Prototype v1.0
No ratings yet
PS 9.1 Re-Implementation - Employee Data Security Prototype v1.0
31 pages
Digital Communication
No ratings yet
Digital Communication
2 pages
Arduino Programming in 24 Hours Richard Blum Softarchive Net PDF
92% (26)
Arduino Programming in 24 Hours Richard Blum Softarchive Net PDF
605 pages
Font Changer Online (???? ??? ????? ?????) LingoJam
No ratings yet
Font Changer Online (???? ??? ????? ?????) LingoJam
1 page
ECOL203 Assignment 1: Life Table Analysis For A Small Wallaby
No ratings yet
ECOL203 Assignment 1: Life Table Analysis For A Small Wallaby
2 pages
BX17 Manual Rev10
No ratings yet
BX17 Manual Rev10
37 pages
XSS Protection - Item Protection
No ratings yet
XSS Protection - Item Protection
3 pages
UM - E-OCD II Debugger Manual - V1.0.2
No ratings yet
UM - E-OCD II Debugger Manual - V1.0.2
92 pages
An Intel Galileo Platform Application Development Using Matlab
No ratings yet
An Intel Galileo Platform Application Development Using Matlab
4 pages
Understanding CATIA V5 UUID and File Release
No ratings yet
Understanding CATIA V5 UUID and File Release
2 pages
Unit I - Polymorphism
No ratings yet
Unit I - Polymorphism
14 pages
Complexity Criteria for Reports
No ratings yet
Complexity Criteria for Reports
2 pages
File Backup Meaning - Google Search
No ratings yet
File Backup Meaning - Google Search
1 page
Headless CMS Implementation Guide
No ratings yet
Headless CMS Implementation Guide
7 pages

BDA

Uploaded by

BDA

Uploaded by

Unit 3

Q1. Explain the different MapReduce types and formats.

Q2. What is MapReduce? How does it facilitate

How MapReduce Facilitates Distributed Data Processing

Q3. Explain in details the characteristics of MongoDB.

2. Schema-Less (Flexible Schema)

Q4. Explain the main components of the MapReduce

Interaction Between Components

How the JobTracker Coordinates Job Execution

Q6. What is MongoDB? Explain the need of MongoDB?

5.Better Support for Modern Applications

Pig Latin is a high-level data flow language used in Apache

How Pig Latin Handles Data Transformation, Filtering,

ii) Hive Metastore

Q3. Differentiate between HBase and Relational

Q4. What is PIG in Hadoop Eco System? Explain the

Different Execution Modes of PIG

Q5. Explain the main components of Apache Hive and

How These Components Interact:

Q6. Explain the key features and characteristics of HBase

You might also like